piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.7k stars 4.38k forks source link

Allow asymmetrical windows for word2vec. #2172

Open generall opened 6 years ago

generall commented 6 years ago

I found that WordRank model could be trained with symmetric parameter. I allows to predict next word in sequence based only on the left ones. I think, this option should be also helpful for other models like Word2Vec and FastText.

gojomo commented 6 years ago

This idea has come up before; is there any evidence that asymmetric windows perform better for certain tasks, and if so, which tasks?

gojomo commented 6 years ago

Compare also the occasional request for custom weighting: #2114

Does the #2173 PR-in-progress still use a reduced_window up-to-the-window-size-on-each-side, to often use an even-smaller-than-specified window?

generall commented 6 years ago

This idea has come up before; is there any evidence that asymmetric windows perform better for certain tasks, and if so, which tasks?

I used left window for transaction prediction. I can't share any specific details, but this approach gave me significant improvement. Also GloVe has similar parameter, so I guess, I am not only one who uses it.

MatthieuMontecot commented 3 years ago

Hi, I would like to use asymnetric windows for IP embedding (in order to replicate IP2Vec implementation). This approach implies to specifie pairs of words as I/O of W2V model (lets say skip-gram). Is there a way to force asymnetric role in pairs of words? (context/prediction)? I saw on blogs that I could modify myself the librairies but I'm not comfortable with messing with the licence.

gojomo commented 3 years ago

There's no built-in support for varying the context windows from the classic 'symmetric around a target word' - so you'd have to modify the code for most such customization. (If such modifications were clean & flexible, with demonstrations of situations where they offer a clear benefit, they could be welcome contributions to the project's main codebase.)

I'm not familiar with the specific 'IP2Vec' approach you've mentioned, but in some case, corpus preprocessing may be able to simulate other policies, especially with regard to skip-gram training.

For example, the 'sentence' ['a', 'b', 'c', 'd', 'e'] with window=5 would normally result in a particular set of skip-gram pairs being used for training (which happens to include every 'word' with every other, though with some probabilistic weighting where the nearer-pairs are trained more often).

If you were, completely outside of Word2Vec, split that 'sentence' into your own smaller sentences before passing to Word2Vec, you could effectively overweight or remove unwanted pairings. For example, if ['a', 'd'] is known to be significant, despite the 3-distance, while ['a', 'c'] is known to be irrelevant, your expansion of the original 5-token sentences into N 2-token sentences could repeat ['a', 'd'] more than once, and leave out ['a', 'c'] completely, to have more influence over the actual skip-grams used without needing to modify Gensim's Word2Vec code at all.

MatthieuMontecot commented 3 years ago

Thanks for this reply, what I want in the case you are talking about is that I want the network to be trained with 'a' as input and 'b' as target but NOT the opposite (I don't want it to train it with 'b' as input and 'a' as target which I think it still does if I give a sentence ['a','b'] in the current implementation. (in IP2Vec, this asymetry helps to distinguish between sources and destination and is a key feature to detect botnet IP after the W2V-based embedding part). Also, do you know if I'm allowed to modify the code if it's not part of it yet?

gojomo commented 3 years ago

Thanks for this reply, what I want in the case you are talking about is that I want the network to be trained with 'a' as input and 'b' as target but NOT the opposite (I don't want it to train it with 'b' as input and 'a' as target which I think it still does if I give a sentence ['a','b'] in the current implementation. (in IP2Vec, this asymetry helps to distinguish between sources and destination and is a key feature to detect botnet IP after the W2V-based embedding part).

I can't think of a way to simulate that in Word2Vec without code-changes. (But also, I'm not sure if in practice it'd make that much difference. If it trains 'b' to be good at skip-gram predicting 'a', but they you never really care about the 'b' input-vector... your results might still be fine in the 'a'->'b' direction. You'd just be doing some unnecessary work. I'd be tempted to try it, just in case it works.)

Depending on other details, you might be able to simulate something more like that in Doc2Vec, especially its PV-DBOW mode (dm=0) - where the single 'floating' full-document vector (tag) is used to predict each document-word in turn, in skip-gram-like fashion, but is never conversely predicted by the word-vectors. But again, not familiar enough with IP2Vec to know for sure.

Also, do you know if I'm allowed to modify the code if it's not part of it yet?

Not sure what you mean by this. All the Gensim source code is available & free-to-use & free-to-modify. (Its LGPL license would only block you from distributing your own modification while claiming more restrictions on users.)

MatthieuMontecot commented 3 years ago

Thank you for your answer, I'm gonna try the Doc2Vec approach!

charanrajt commented 3 years ago

@MatthieuMontecot did Doc2Vec approach work for IP2Vec?

charanrajt commented 3 years ago

@gojomo where can find the skip-gram and CBOW implementation in the code base?

gojomo commented 3 years ago

@gojomo where can find the skip-gram and CBOW implementation in the code base?

@charanrajt Search for cbow and sg in the word2vec.py file, and especially the optimized cython word2vec_inner.pyx file (where the serious tight loops are done).