maxoodf / word2vec

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
Apache License 2.0
132 stars 24 forks source link

why normalisation when loading the model #10

Closed jwijffels closed 4 years ago

jwijffels commented 4 years ago

Hello @maxoodf I'm checking the package a bit alongside these models available at https://nlp.h-its.org/bpemb I noticed that when loading the model, you basically standardise the embeddings as in https://github.com/maxoodf/word2vec/blob/master/lib/word2vec.cpp#L187 But when computing the nearest based on a distance https://github.com/maxoodf/word2vec/blob/master/include/word2vec.hpp#L178 You again do this. This seems doing things twice. Is this expected behaviour?

maxoodf commented 4 years ago

Hello @jwijffels The first is vectors normalisation, I have to do it on data loading for the compatibility with the original data format. I think normalisation operation on data saving should be more effective, but Mikolov's approach is to perform the nornalization on loading. The second is a distance calculation between two vectors and it does not related to the normalisation.

jwijffels commented 4 years ago

My point is basically there will be discrepancies between the following 2 approaches

  1. build a model and use it directly to do distance calculation using the nearest functionalities
  2. build the model, next save the model, load it back in and next perform distance calculation using the model

Approach 1 will do the distance calculation on the non-normalised embeddings, approach 2 will do distance calculation on the normalised embedding, providing different results.

maxoodf commented 4 years ago

You are right, but the 1st approach is not intended to be used.

jwijffels commented 4 years ago

Good to know. But that's unfortunately how I implemented it at https://github.com/bnosac/word2vec blindly looking at the headers. I'm basically returning a pointer to the models when running word2vec such that users can work directly with the model instead of doing a round trip of saving and loading back in to get normalised vectors. So your advice is first save the model and next load back before doing distance calculation.

jwijffels commented 4 years ago

In that case, I'll write something myself to perform this normalisation directly after training and don't do the normalisation when loading the model by default but leave the option in there to do normalisation if a model is not generated by your version of word2vec. Something similar as commit https://github.com/bnosac/word2vec/commit/19b9603ec3d2788bb9b63c2f3b3a510cac83ea2e That will be faster indeed when loading as well. Is there are reason why normalisation is not happening in the d2vModel_t?

maxoodf commented 4 years ago

d2vModel_t is a set of average vectors calculated from normalised vectors, I do not see any reason to normalise d2vModel_t vectors once again.

jwijffels commented 4 years ago

I'll go for option 2 and implement that as part of the R package. Thanks for the input on d2vModel_t