piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Support (> 10000)-token texts in `infer_vector()` #2583

Open gojomo opened 5 years ago

gojomo commented 5 years ago

As infer_vector() uses the same optimized Cython functions as training behind-the-scenes, it also suffers from the same fixed-token-buffer size as training, where texts with more than 10000 tokens have all overflow tokens ignored.

But, this might be easier to fix for inference, as the it could be easy to call the training functions with a mini-batch that just reuses the same temporary candidate vector-in-training.

(And in fact, thinking about this makes me wonder if we should just auto-chunk documents during training, too – perhaps with a warning to the user the 1st time this happens. Previously, I'd wanted to fix this limitation by using alloca() to replace our fixed 10000-slot on-stack arrays with variable-length on-stack arrays - which worked in my tests, and perhaps even offered a memory-compactness advantage/speedup for all the cases where texts were smaller than 10000 tokens – but, alloca() isn't guaranteed to be available everywhere, even though in practice it seems to be everywhere we support.)

gojomo commented 10 months ago

Deep in discussion of an SO question (https://stackoverflow.com/q/77371009/130288), a user comment reminded me: while the 10k token limit in Doc2Vec training can be worked-around during training (by repeating the same 'tag' for multiple texts, closely-approximating how a doc of arbitrary size would obtain a single vector), no similar workaround is possible when using infer_vector().

Specifically: among other potential uses this impairs, it fouls the crude 'sanity check' of Doc2Vec training-consistency where you re-infer vectors for a training doc, and ensure they're generally "close" to the precalculated vectors for the same docs. (The pretrained doc-vecs will reflect all words across all the training-text[-chunks] supplied with the same tag, whereas feeding that same mega-doc to infer_vector() will only yield a doc-vec for the 1st 10k tokens.)

Obviously a general removal of the 10k limit (#2880) would resolve this.

The auto-chunking mentioned above might be a simple tweak, achievable in the pure-Python code (or as an alternate helper function), easiest to apply to infer_vector() only. It might mesh well with things mentioned in #515 (like batching inference of many docs or allowing user-supplied starting vectors).

Some sort of auto-chunking deep in the shared train/infer routines could be a more general workaround for #2880.

Biting the bullet of using not-officially-standard-but-practically-everywhere-we-care-about alloca() deep in the Cython would be the most concise & simple fix, with potential other spillover performance benefits (smaller allocations, now on-stack, in usual case of smaller-than-10k docs) but one new outlier risk: users supplying docs larger than stack-allocable memory. (That might be rare enough, & fail legibly enough, that's it's no more concern than other places where extremely-atypical usages trigger understandable errors. Or perhaps: addressable with softer & more-legible warnings or exception-handling.)