Closed jwijffels closed 4 years ago
Hello @jwijffels The first is vectors normalisation, I have to do it on data loading for the compatibility with the original data format. I think normalisation operation on data saving should be more effective, but Mikolov's approach is to perform the nornalization on loading. The second is a distance calculation between two vectors and it does not related to the normalisation.
My point is basically there will be discrepancies between the following 2 approaches
Approach 1 will do the distance calculation on the non-normalised embeddings, approach 2 will do distance calculation on the normalised embedding, providing different results.
You are right, but the 1st approach is not intended to be used.
Good to know. But that's unfortunately how I implemented it at https://github.com/bnosac/word2vec blindly looking at the headers. I'm basically returning a pointer to the models when running word2vec such that users can work directly with the model instead of doing a round trip of saving and loading back in to get normalised vectors. So your advice is first save the model and next load back before doing distance calculation.
In that case, I'll write something myself to perform this normalisation directly after training and don't do the normalisation when loading the model by default but leave the option in there to do normalisation if a model is not generated by your version of word2vec. Something similar as commit https://github.com/bnosac/word2vec/commit/19b9603ec3d2788bb9b63c2f3b3a510cac83ea2e That will be faster indeed when loading as well. Is there are reason why normalisation is not happening in the d2vModel_t?
d2vModel_t is a set of average vectors calculated from normalised vectors, I do not see any reason to normalise d2vModel_t vectors once again.
I'll go for option 2 and implement that as part of the R package. Thanks for the input on d2vModel_t
Hello @maxoodf I'm checking the package a bit alongside these models available at https://nlp.h-its.org/bpemb I noticed that when loading the model, you basically standardise the embeddings as in https://github.com/maxoodf/word2vec/blob/master/lib/word2vec.cpp#L187 But when computing the nearest based on a distance https://github.com/maxoodf/word2vec/blob/master/include/word2vec.hpp#L178 You again do this. This seems doing things twice. Is this expected behaviour?