piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.64k stars 4.38k forks source link

Doc2Vec: mixing plain int and string tags will give unexpected results #491

Closed gojomo closed 8 years ago

gojomo commented 9 years ago

For typical Doc2Vec use with a large corpus and only a single (ID) tag per document, the best approach is to use contiguous plain ints starting from 0. (It avoids creating a string tag->index dict.)

Docs can, however, have more than one tag, and it's quite natural to think such other tags will be symbolic, just as a category-name. However, mixing plain-int and string tags in the same corpus will lead to confusing results: the ints are taken as literal indexes into the array of docvecs-in-trainig, but any string is assigned the next-available-int when it is first encounterd. Unless the user has been careful not to use that same int as a plain int ID, the int and string will indvertently share the same vector.

Using only ints or strings will work well in the current code, but the scan can probably be adapted to avoid this problem. (First idea: strings are given temporary provisional indexes upon first discovery: perhaps, negative indexes. Only at the end of the full corpus scan are all string tags – those in model.docvecs.doctags and model.docvecs.index2doctag given real final positive int indexes, at the end of all other indexes. In the case where only string tags are used, their assigned indexes will wind up as the same starting-from-zero ints as before.)

gojomo commented 9 years ago

String doctags (and the Doctag) now remember their position not as an index-from-0, but an offset from the end of any raw-int index slots (if any). The property index2doctag is thus renamed offset2doctag, and a max_rawint value is also tracked during tag-discovery (vocab-scan pass).

In the case of all raw-int tags, this difference is irrelevant: dict doctags and list offset2doctag remain empty. In the case of all string tags, the offset2doctag position still map as 0-based indexes into the arrays. Only in the mixed case does position 0 of offset2doctag now mean index max_rawint + 1 when accessing the arrays. (Though the mixed case won't be typical, it is likely to be attractive for some semi-supervised Doc2Vec uses.)

rambo-yuanbo commented 6 years ago

hi, gojomo if I use only integer tags, but not consecutive ints (some int tags are skkiped because some of the input docs are considered invalid and thrown away), what am I expecting after the training? Am I gonna see a all zero vectors for those unseen tag to the trained model ?

gojomo commented 6 years ago

The model will allocate slots for those never-seen ints, and those slots will get the same pre-training small-random-vector initialization as other doc-vectors – but then never be further trained. So they will have small, random initialized values. They'll waste memory, but shouldn't have any other negative effect on other vectors. (Unless you applied some other non-standard process on 'all' vectors - like calculating the average of all vectors, or applying clustering on all vectors. It'd be best to drop the bad-docs before assigning unique consecutive IDs. Or if memory isn't an issue, using string-doc-tags.