piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.57k stars 4.37k forks source link

Feature Suggestion: Wang2Vec #547

Open shirish93 opened 8 years ago

shirish93 commented 8 years ago

Hello all,

I have looked around extensively to see if anyone's worked on porting the wang2vec implementation to gensim, but haven't been able to. Is there any interest towards that direction, or is that an option not being considered? A version of Word2Vec that considers the word-ordering would be a kick-ass feature to have, particularly for tasks that rely on syntactic features. I noticed several of the more recent prepublished papers on meaning disambiguation tend to lean on that side.

As a side note, I see that the 'retrofitting Word2Vec' by Faruqui et. al. is a particularly easy port, so would there be interest in including that as a feature? I figure that would be an easy pull for someone like myself, who's never contributed to any project before. On the other hands, I see that standards for gensim for are particularly high, so it's not clear it's desirable.

Edited: Grammar.

piskvorky commented 8 years ago

Hello @shirish93 ! Thanks for the suggestion, but let me step back a little -- are these two separate algorithms, two feature requests?

I never heard of wang2vec, but if it's useful and fits gensim's scope (unsupervised text semantics + text similarity), then it would be a welcome addition!

Generally speaking, we try to implement only things that have proven to be useful, either by having a clear use case or multiple people asking for it. There are just too many papers coming out to "properly" implement everything.

Can you explain a little more about how either of these two features would be used, what is the benefit to gensim users? Cheers.

shirish93 commented 8 years ago

Thanks for the quick response @piskvorky

Wang2Vec is based on Two/Too Simple Adaptations of Word2Vec for Syntax Problems, ( and implemented here ) and has an algorithm that seems to preserve word order information (says the paper). It suggests preserving order is important for syntactic tasks.

From the paper:

However, as these models are insensitive to word order, 
embeddings built using these models are suboptimal for tasks 
involving syntax, such as part-of-speech
tagging or dependency parsing.

I realize pre-published papers aren't the best source of recommendations, but the Sense2Vec paper that has been recently getting some attention seems to use a lot of Wang2Vec. As to the performance, I've seen anecdotal evidence in online forums (besides the paper itself) discussing its greater usefulness in syntax-related tasks. Regardless, I would suspect that as there's more work done on working with syntax and word embedding models, this might get increasingly relevant. It might be worth considering to keep it in mind for longer-term? I get the point of not implementing every popular paper out there, but preservation of word order information is something that would be a significant addition to gensim.

The second request (Retrofitting Word2Vec) is a different feature/algorithm request as I didn't want to spam the issues section with multiple posts. It seems particularly interesting because it doesn't involve the actual corpus or training per se. It suggests the possibility of combining existing trained models with available lexicon data to improve performance. This would be a use case: a user downloads a pre-trained model, fiddles around with the lexicon data, and then uses it to 'retrofit' the model. This would give the users an ability to fiddle with their models post-training and without the corpus, something that is not possible currently. An exciting possibility here could be a method to 'merge' different models in some kind of meaningful way. @gojomo has answered to a SO question that asks for a method to 'combine' different models. If word-word relationships generated by a model were turned into a 'lexicon' , and that information were used to 'retrofit' a different model, that might generate a theoretically sounder way of combination? This is just one use case scenario, to suggest that adding the ability to retrofit would open a whole new world of post-training exploration for the users.

For full disclosure, the 'retrofit' algorithm would be easy to port (already implemented using numpy), and I wanted to contribute to gensim somehow, so figured that would be an easy and useful way to make a contribution; not emotionally attached to either: )

gojomo commented 8 years ago

From a quick look at the Wang et al "Too/Two..." paper, the "structured skip-ngram" approach looks very easy to adapt: just another dimension to syn1 and an extra array-index (based on window-offset) during training. (That is to say: probably implementable by an extra parameter slightly changing the skip-gram paths, rather than a whole new path.)

The "CWindow" approach appears a little more complicated – requiring a new training-mode – but very similar to the dm_concat mode experimentally implemented in Doc2Vec. So that dm_concat code could be used as a model (or maybe, with a really inspired bit of refactoring, be shared for both uses).

For these features, modifications to the existing Word2Vec (and maybe Doc2Vec) classes make sense.

From an even quicker look at the Faruqui et al 'retrofit' paper, it seems a little like the learning-a-projection-to-a-new-space technique of the machine-translation word2vec papers or the 'Skip-Thoughts' section 2.2 'vocabulary expansion', However, the training goal isn't correlation with some other word-space, but conformance with desired distances implied by the lexicons. (Is that right?)

I don't have a strong sense whether this would fit better as a method on the existing classes that applies the retrofit, or some external utility that works on Word2Vec-model-like things. Feature like this suggest to me that we might want to refactor the sets-of-string-named-vectors to stand outside full Word2Vec/Doc2Vec models, so I've made an issue (#549) to discuss that idea.

iamtrask commented 8 years ago

If you're considering embeddings that model order (such as wang2vec), I'll shamelessly plug the approaches in http://arxiv.org/pdf/1506.02338.pdf. The primary advantage is a 40% reduction in word-analogy error over word2vec.

piskvorky commented 8 years ago

Nice! The wishlist is getting bigger... Christmas is coming, time for a pull request? :)

WenchenLi commented 7 years ago

just a quick catch up on anyone ported wang2vec to gensim?