maxoodf / word2vec

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
Apache License 2.0
132 stars 24 forks source link

[question] doc2vec results + merging/updating models #1

Closed NicolasWinckler closed 7 years ago

NicolasWinckler commented 7 years ago

Hello, thank you very much for sharing your code.

I have tried the doc2vec example with several models, including the four pre-trained english models available on your github, and the one obtained from the original google code and data, and I could not reproduce the results that are reported in

https://github.com/maxoodf/word2vec/blob/master/examples/doc2vec/main.cpp

4: 0.976313 6: 0.971176 3: 0.943542 7: 0.850593 1: 0.749066 2: 0.724662 5: 0.587743

The order are the same but the cosine similarity values are much closer to each other (and much less discriminant). For example this is what I obtain with the sg_hs_500_10.w2v model:

4: 0.995932 6: 0.995018 3: 0.992355 7: 0.981416 1: 0.969636 2: 0.969345 5: 0.953782

Do you know the reason for this difference?

I had another question: Is there a possibility to merge wor2vec models, or to update word2vecmodels from new corpus?

Thanks for your help

maxoodf commented 7 years ago

Hi Nicolas, the reason is Skip-gram models are not well for doc2vec. Try to use CBOW (negative sampling is preferred), for example - cb_ns_500_10.w2v. From my experience the best models for English texts are:

  1. CBOW/Negative sampling, vector size = 300, window = 10, stop words (like 'the', 'of', 'and', 'to', 'in', 'a', 's', 'for', 'is', 'that', 'was', 'on', 'with' etc)
  2. CBOW/Negative sampling, vector size = 500, window = 10, stop words
  3. CBOW/Negative sampling, vector size = 750, window = 5, stop words

Regarding models update, unfortunately it's the weak point of word2vec - model must be completly retrained on a new/merged corpus.

Regards, Max

NicolasWinckler commented 7 years ago

Hi Max, thanks a lot for your prompt reply! Ah ok, I didn t know skip-gram was not good for doc2vec. Yes, indeed the results are better using the cb_ns_500_10.w2v Here are the output for information:

4: 0.987355 6: 0.984883 3: 0.970437 7: 0.921851 1: 0.865788 2: 0.850784 5: 0.767465

thanks again, Regards,

Nicolas