prodriguezsosa / conText

An R package for estimating and doing statistical inference on context-specific word embeddings.
100 stars 19 forks source link

default naming convention for `dimnames()$doc` in `dem` object #13

Open ArthurSpirling opened 2 years ago

ArthurSpirling commented 2 years ago

The default for dimnames()$doc on a dem object are currently text1, text2 etc. That is e.g.

toks <- tokens(cr_sample_corpus)
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)
immig_dfm <- dfm(immig_toks)
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
dimnames(immig_dem)$doc

returns

[1] "text1"    "text2"    "text3"    "text4"    "text5"    "text6"    "text7" 

etc

I wonder if this should be something other than text because it potentially gives the impression that each incidence is a new "text" (a separate document). But, of course, the whole point here is that one can have many instantiations (and thus many embeddings of the same term) in the same document.

Perhaps we could change it to instance (or occurrence or observation or incidence)? Open to not doing anything, but just want to avoid confusion for end-users.

ArthurSpirling commented 2 years ago

follow up: looks like dem() currently inherits the dimnames()$doc from quanteda::docid (?) which may be undesirable.

prodriguezsosa commented 2 years ago

Thanks @ArthurSpirling .

follow up: looks like dem() currently inherits the dimnames()$doc from quanteda::docid (?) which may be undesirable.

Yes, so it inherits the docid from the tokens object used (i.e. the x argument), which in most cases will be from tokens_context. Why is this undesirable? The reasoning for that was that you'd want to know which docs were actually embedded so e.g. if a doc (or instance) is not embedded because none of its contexts were in the pre-trained embeddings then you can identify it by looking at which docs are in docid(toks_obj) but aren't in docid(dem_obj).

I wonder if this should be something other than text because it potentially gives the impression that each incidence is a new "text" (a separate document).

I see your point. Am thinking perhaps we should have an attribute which captures the original text id, and one which captures the instance id. That way we can a. link back to the original document and b. check which instances were actually embedded (re. my comment above).

Unrelated note, your suggestion for the summary function might be applicable to adjust quanteda's docvars function to process dem class objects, such that we can run docvars(dem_obj) instead of dem_obj@docvars as we currently do. Working first on the summary function though.

ArthurSpirling commented 2 years ago

Thanks @ArthurSpirling .

follow up: looks like dem() currently inherits the dimnames()$doc from quanteda::docid (?) which may be undesirable.

Yes, so it inherits the docid from the tokens object used (i.e. the x argument), which in most cases will be from tokens_context. Why is this undesirable? The reasoning for that was that you'd want to know which docs were actually embedded so e.g. if a doc (or instance) is not embedded because none of its contexts were in the pre-trained embeddings then you can identify it by looking at which docs are in docid(toks_obj) but aren't in docid(dem_obj).

I wonder if this should be something other than text because it potentially gives the impression that each incidence is a new "text" (a separate document).

I see your point. Am thinking perhaps we should have an attribute which captures the original text id, and one which captures the instance id. That way we can a. link back to the original document and b. check which instances were actually embedded (re. my comment above).

@prodriguezsosa thanks. Yes, this seems like the right idea. All I meant by "undesirable" was that if, say, there are 300 instances (but, say 150 documents), it will return text1 through text300, but this suggests these are same thing that quanteda means by documents 1 through 300, which it isn't. It's a new thing, which describes the instance of a word for an embedding etc.