New models trained with Google Word2Vec not processed correctly

cjmcmurtrie commented 8 years ago

Hi there, thanks for this very useful tool.

This seems to work perfectly with the pre-trained Google Word2Vec model, but I am having issues processing new models that I trained using that code.

The (saved as binary) models trained with word2vec.c work correctly in the demos implemented and provided by Mikolov in the package, eg:

Enter word or sentence (EXIT to break): hello

Word: hello  Position in vocabulary: 3560

                                              Word       Cosine distance
------------------------------------------------------------------------
                                                hi      0.538164
                                               hey      0.469036
                                               *(?      0.401341
                                           pedants      0.396846

However, when I try to port the models into my Torch programs, I get a dictionary of vectors such as the following:

  D?+?<u?? : FloatTensor - size: 200
  xT?<? : FloatTensor - size: 200
  ????G>???? : FloatTensor - size: 200

It seems to me that the code in bintot7.lua is trying to process the binary strings into ascii rather than utf-8. In your code, are you explicitly decoding the binary strings to ascii, rather than utf-8/unicode? Do you know anything about this and how we could fix it?

rotmanmi commented 8 years ago

Hi,

Thank you for the input, it seems that it only loads ascii. I'll try to identify UTF-8 encoding and implement it as well in the next few days.

Best,

Michael.

On Tue, Feb 9, 2016 at 11:52 AM, cjmcmurtrie notifications@github.com wrote:

Hi there, thanks for this very useful tool.

This seems to work perfectly with the pre-trained Google Word2Vec model, but I am having issues processing new models that I trained using that code.

The (saved as binary) models trained with word2vec.c work correctly in the demos implemented and provided by Mikolov in the package, eg:

Enter word or sentence (EXIT to break): hello

Word: hello Position in vocabulary: 3560
                                          Word       Cosine distance
                                            hi      0.538164
                                           hey      0.469036
                                           *(?      0.401341
                                       pedants      0.396846
However, when I try to port the models into my Torch programs, I get a dictionary of vectors such as the following:

D?+?<u?? : FloatTensor - size: 200 xT?<? : FloatTensor - size: 200 ????G>???? : FloatTensor - size: 200

It seems to me that the code in bintot7.lua is trying to process the binary strings into ascii rather than utf-8. In your code, are you explicitly decoding the binary strings to ascii, rather than utf-8/unicode? Do you know anything about this and how we could fix it?

— Reply to this email directly or view it on GitHub https://github.com/rotmanmi/word2vec.torch/issues/2.

cjmcmurtrie commented 8 years ago

Hey, thanks for getting back Michael. I'll also be working on this and will let you know if I make any progress.

cjmcmurtrie commented 8 years ago

I have an update regarding this.

The models trained straight from the Google C codebase did not read correctly.

However, the following steps made it possible to load them into Torch using your code, with a utf-8 unicode vocabulary:

1 . Train model with Google C scrip word2vec.c, save as binary. 2 . Load model with Python package Gensim and save again with Gensim:

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('full-russian//test-russian-vectors.bin', binary=True)
model.save_word2vec_format('full-russian//test-russian-vectors-gensimsaved.bin', binary=True)

3 . Load model in Torch with bintot7.lua.

Following this procedure loads the word vectors correctly, for example:

  второму : FloatTensor - size: 200
  уклонялся : FloatTensor - size: 200
  прозаического : FloatTensor - size: 200
  горьких : FloatTensor - size: 200

Furthermore, inspecting the contents of the Google trained model shows that the vocabulary is lists of unicode character codes, rather than byte strings:

print model.most_similar(['софьи'.decode('utf8')])
>>> [(u'\u0446\u0430\u0440\u0435\u0432\u043d\u044b', 0.5405951738357544), (u'\u0435\u0432\u0434\u043e\u043a\u0438\u044f', 0.4162743091583252), ...

What do you think? Does this clarify anything at all?

rotmanmi commented 8 years ago

Hi cjmcmurtrie,

This is great. Do you mind that I add your solution to the README?

Thanks,

Michael.

rotmanmi / word2vec.torch

New models trained with Google Word2Vec not processed correctly #2