Closed NicolasWinckler closed 7 years ago
Hi Nicolas, the reason is Skip-gram models are not well for doc2vec. Try to use CBOW (negative sampling is preferred), for example - cb_ns_500_10.w2v. From my experience the best models for English texts are:
Regarding models update, unfortunately it's the weak point of word2vec - model must be completly retrained on a new/merged corpus.
Regards, Max
Hi Max, thanks a lot for your prompt reply! Ah ok, I didn t know skip-gram was not good for doc2vec. Yes, indeed the results are better using the cb_ns_500_10.w2v Here are the output for information:
4: 0.987355 6: 0.984883 3: 0.970437 7: 0.921851 1: 0.865788 2: 0.850784 5: 0.767465
thanks again, Regards,
Nicolas
Hello, thank you very much for sharing your code.
I have tried the doc2vec example with several models, including the four pre-trained english models available on your github, and the one obtained from the original google code and data, and I could not reproduce the results that are reported in
https://github.com/maxoodf/word2vec/blob/master/examples/doc2vec/main.cpp
4: 0.976313 6: 0.971176 3: 0.943542 7: 0.850593 1: 0.749066 2: 0.724662 5: 0.587743
The order are the same but the cosine similarity values are much closer to each other (and much less discriminant). For example this is what I obtain with the sg_hs_500_10.w2v model:
4: 0.995932 6: 0.995018 3: 0.992355 7: 0.981416 1: 0.969636 2: 0.969345 5: 0.953782
Do you know the reason for this difference?
I had another question: Is there a possibility to merge wor2vec models, or to update word2vecmodels from new corpus?
Thanks for your help