piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.71k stars 4.38k forks source link

Sample weights. #2701

Open csbrown opened 4 years ago

csbrown commented 4 years ago

Problem description

I need to weight my data in training. The Gensim API does not currently provide this functionality.

Proposed solution

I'm pretty sure adding a sample_weight arg to train_cbow_pair, train_sg_pair, Word2Vec.__init__, and Word2Vec.train will get the argument everywhere it needs to be. Also, adding neu1e *= sample_weight to train_cbow_pair (line 375) and train_sg_pair (line 280) will accomplish a naive weighting scheme by simply weighting the individual loss terms. I'm happy to implement this and write some tests if anyone seconds this motion.

gojomo commented 4 years ago

Note that the train_cbow_pair and train_sg_pair pure-Python paths are no longer live in the develop branch, and only the Cython code paths will be maintained in the future.

To add new options to such core paths, it's most important to have (1) a clear rationale, ideally with demonstrations of its usage & the unique benefit it provides; (2) evidence it causes minimal-or-no complexity/slowdown for the common cases. So there really needs to be a working-demo before making a truly informed judgement on whether it's worth integrating.

One simple way of overweighting some training examples in these models is to repeat them – ideally not in a consecutive clump, but throughout the training data.

Also, note the experimental _lockf arrays provide a per-word multiplier factor for adjustments to the corresponding word. While the original motivation was to provide a way to soften (or eliminate) updates to certain words, it might also be useful for overweighting updates to chosen words.

aduriseti commented 3 years ago

hey @csbrown - did you ever implement this?

did not see it on your fork

may end up doing this myself if u have not

csbrown commented 3 years ago

@aduriseti I did not implement this, no. My use case was.... unusual.... so I got the impression that they didn't want such a feature here. I ended up just making a W2V model in tensorflow. My stuff is here. Most relevant to this are tf_w2v_models.py which depends on tensorflow_models.py for pickling support.