Open ArthurSpirling opened 2 years ago
follow up: looks like dem()
currently inherits the dimnames()$doc
from quanteda::docid
(?) which may be undesirable.
Thanks @ArthurSpirling .
follow up: looks like
dem()
currently inherits thedimnames()$doc
fromquanteda::docid
(?) which may be undesirable.
Yes, so it inherits the docid
from the tokens object used (i.e. the x argument), which in most cases will be from tokens_context
. Why is this undesirable? The reasoning for that was that you'd want to know which docs were actually embedded so e.g. if a doc (or instance) is not embedded because none of its contexts were in the pre-trained embeddings then you can identify it by looking at which docs are in docid(toks_obj)
but aren't in docid(dem_obj)
.
I wonder if this should be something other than text because it potentially gives the impression that each incidence is a new "text" (a separate document).
I see your point. Am thinking perhaps we should have an attribute which captures the original text id, and one which captures the instance id. That way we can a. link back to the original document and b. check which instances were actually embedded (re. my comment above).
Unrelated note, your suggestion for the summary function might be applicable to adjust quanteda's docvars
function to process dem
class objects, such that we can run docvars(dem_obj)
instead of dem_obj@docvars
as we currently do. Working first on the summary function though.
Thanks @ArthurSpirling .
follow up: looks like
dem()
currently inherits thedimnames()$doc
fromquanteda::docid
(?) which may be undesirable.Yes, so it inherits the
docid
from the tokens object used (i.e. the x argument), which in most cases will be fromtokens_context
. Why is this undesirable? The reasoning for that was that you'd want to know which docs were actually embedded so e.g. if a doc (or instance) is not embedded because none of its contexts were in the pre-trained embeddings then you can identify it by looking at which docs are indocid(toks_obj)
but aren't indocid(dem_obj)
.I wonder if this should be something other than text because it potentially gives the impression that each incidence is a new "text" (a separate document).
I see your point. Am thinking perhaps we should have an attribute which captures the original text id, and one which captures the instance id. That way we can a. link back to the original document and b. check which instances were actually embedded (re. my comment above).
@prodriguezsosa thanks. Yes, this seems like the right idea. All I meant by "undesirable" was that if, say, there are 300 instances (but, say 150 documents), it will return text1
through text300
, but this suggests these are same thing that quanteda
means by documents 1 through 300, which it isn't. It's a new thing, which describes the instance of a word for an embedding etc.
The default for
dimnames()$doc
on adem
object are currentlytext1
,text2
etc. That is e.g.returns
etc
I wonder if this should be something other than
text
because it potentially gives the impression that each incidence is a new "text" (a separate document). But, of course, the whole point here is that one can have many instantiations (and thus many embeddings of the same term) in the same document.Perhaps we could change it to
instance
(oroccurrence
orobservation
orincidence
)? Open to not doing anything, but just want to avoid confusion for end-users.