piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Fully supporting incremental updation of vocabulary in Word2Vec model #1493

Open chinmayapancholi13 opened 7 years ago

chinmayapancholi13 commented 7 years ago

Updation of vocabulary in Word2Vec model is experimental right now. Addressing this issue based on the discussion here would be useful at other places like adding partial_fit() function for the sklearn-API class for Word2Vec.

piskvorky commented 7 years ago

Related: #900 (also #700, #775, #435).

gojomo commented 7 years ago

In my opinion, to make this non-experimental would require some significant research into what the kinds of datasets & specific settings where it offers an advantage, and where it just spends time with little or negative benefit. Whether incremental training improves this kind of model is inherently very context-dependent.

(Personally I'd expect a system where all existing words/weights are frozen, and new word-vectors inferred in a process a bit like Doc2Vec inference, to be a more stable/defensible/error-resistant approach.)

piskvorky commented 7 years ago

Yes, two directions here -- 1) making it possible 2) determining whether it makes sense.

900 deals with 1); @gojomo is talking about 2).

If we have 1), we could outsource 2) to all the people who are asking (perhaps mistakenly) for this feature. It's one of the most requested properties of 2vec, which probably reflects some common underlying need across many applications of 2vec.

gojomo commented 7 years ago

We have (1), that's why my focus is on (2). And (2) is only possible after we either get a bunch of research/experimentation done, or manage to collect such results from other people. Until then, I believe the existing (1) "it's possible" feature needs lots of caveats/disclaimers that effectively discourage beginners from relying upon it.