Open csbrown opened 4 years ago
Note that the train_cbow_pair
and train_sg_pair
pure-Python paths are no longer live in the develop
branch, and only the Cython code paths will be maintained in the future.
To add new options to such core paths, it's most important to have (1) a clear rationale, ideally with demonstrations of its usage & the unique benefit it provides; (2) evidence it causes minimal-or-no complexity/slowdown for the common cases. So there really needs to be a working-demo before making a truly informed judgement on whether it's worth integrating.
One simple way of overweighting some training examples in these models is to repeat them – ideally not in a consecutive clump, but throughout the training data.
Also, note the experimental _lockf
arrays provide a per-word multiplier factor for adjustments to the corresponding word. While the original motivation was to provide a way to soften (or eliminate) updates to certain words, it might also be useful for overweighting updates to chosen words.
hey @csbrown - did you ever implement this?
did not see it on your fork
may end up doing this myself if u have not
@aduriseti I did not implement this, no. My use case was.... unusual.... so I got the impression that they didn't want such a feature here. I ended up just making a W2V model in tensorflow. My stuff is here. Most relevant to this are tf_w2v_models.py
which depends on tensorflow_models.py
for pickling support.
Problem description
I need to weight my data in training. The Gensim API does not currently provide this functionality.
Proposed solution
I'm pretty sure adding a
sample_weight
arg totrain_cbow_pair
,train_sg_pair
,Word2Vec.__init__
, andWord2Vec.train
will get the argument everywhere it needs to be. Also, addingneu1e *= sample_weight
totrain_cbow_pair
(line 375) andtrain_sg_pair
(line 280) will accomplish a naive weighting scheme by simply weighting the individual loss terms. I'm happy to implement this and write some tests if anyone seconds this motion.