Closed gojomo closed 8 years ago
String doctags (and the Doctag
) now remember their position not as an index-from-0, but an offset from the end of any raw-int index slots (if any). The property index2doctag
is thus renamed offset2doctag
, and a max_rawint
value is also tracked during tag-discovery (vocab-scan pass).
In the case of all raw-int tags, this difference is irrelevant: dict doctags
and list offset2doctag
remain empty. In the case of all string tags, the offset2doctag
position still map as 0-based indexes into the arrays. Only in the mixed case does position 0 of offset2doctag
now mean index max_rawint + 1
when accessing the arrays. (Though the mixed case won't be typical, it is likely to be attractive for some semi-supervised Doc2Vec uses.)
hi, gojomo if I use only integer tags, but not consecutive ints (some int tags are skkiped because some of the input docs are considered invalid and thrown away), what am I expecting after the training? Am I gonna see a all zero vectors for those unseen tag to the trained model ?
The model will allocate slots for those never-seen ints, and those slots will get the same pre-training small-random-vector initialization as other doc-vectors – but then never be further trained. So they will have small, random initialized values. They'll waste memory, but shouldn't have any other negative effect on other vectors. (Unless you applied some other non-standard process on 'all' vectors - like calculating the average of all vectors, or applying clustering on all vectors. It'd be best to drop the bad-docs before assigning unique consecutive IDs. Or if memory isn't an issue, using string-doc-tags.
For typical Doc2Vec use with a large corpus and only a single (ID) tag per document, the best approach is to use contiguous plain ints starting from 0. (It avoids creating a string tag->index dict.)
Docs can, however, have more than one tag, and it's quite natural to think such other tags will be symbolic, just as a category-name. However, mixing plain-int and string tags in the same corpus will lead to confusing results: the ints are taken as literal indexes into the array of docvecs-in-trainig, but any string is assigned the next-available-int when it is first encounterd. Unless the user has been careful not to use that same int as a plain int ID, the int and string will indvertently share the same vector.
Using only ints or strings will work well in the current code, but the scan can probably be adapted to avoid this problem. (First idea: strings are given temporary provisional indexes upon first discovery: perhaps, negative indexes. Only at the end of the full corpus scan are all string tags – those in
model.docvecs.doctags
andmodel.docvecs.index2doctag
given real final positive int indexes, at the end of all other indexes. In the case where only string tags are used, their assigned indexes will wind up as the same starting-from-zero ints as before.)