Open cjmcmurtrie opened 8 years ago
Hi,
Thank you for the input, it seems that it only loads ascii. I'll try to identify UTF-8 encoding and implement it as well in the next few days.
Best,
Michael.
On Tue, Feb 9, 2016 at 11:52 AM, cjmcmurtrie notifications@github.com wrote:
Hi there, thanks for this very useful tool.
This seems to work perfectly with the pre-trained Google Word2Vec model, but I am having issues processing new models that I trained using that code.
The (saved as binary) models trained with word2vec.c work correctly in the demos implemented and provided by Mikolov in the package, eg:
Enter word or sentence (EXIT to break): hello
Word: hello Position in vocabulary: 3560
Word Cosine distance
hi 0.538164 hey 0.469036 *(? 0.401341 pedants 0.396846
However, when I try to port the models into my Torch programs, I get a dictionary of vectors such as the following:
D?+?<u?? : FloatTensor - size: 200 xT?<? : FloatTensor - size: 200 ????G>???? : FloatTensor - size: 200
It seems to me that the code in bintot7.lua is trying to process the binary strings into ascii rather than utf-8. In your code, are you explicitly decoding the binary strings to ascii, rather than utf-8/unicode? Do you know anything about this and how we could fix it?
— Reply to this email directly or view it on GitHub https://github.com/rotmanmi/word2vec.torch/issues/2.
Hey, thanks for getting back Michael. I'll also be working on this and will let you know if I make any progress.
I have an update regarding this.
The models trained straight from the Google C codebase did not read correctly.
However, the following steps made it possible to load them into Torch using your code, with a utf-8 unicode vocabulary:
1 . Train model with Google C scrip word2vec.c, save as binary. 2 . Load model with Python package Gensim and save again with Gensim:
from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('full-russian//test-russian-vectors.bin', binary=True)
model.save_word2vec_format('full-russian//test-russian-vectors-gensimsaved.bin', binary=True)
3 . Load model in Torch with bintot7.lua.
Following this procedure loads the word vectors correctly, for example:
второму : FloatTensor - size: 200
уклонялся : FloatTensor - size: 200
прозаического : FloatTensor - size: 200
горьких : FloatTensor - size: 200
Furthermore, inspecting the contents of the Google trained model shows that the vocabulary is lists of unicode character codes, rather than byte strings:
print model.most_similar(['софьи'.decode('utf8')])
>>> [(u'\u0446\u0430\u0440\u0435\u0432\u043d\u044b', 0.5405951738357544), (u'\u0435\u0432\u0434\u043e\u043a\u0438\u044f', 0.4162743091583252), ...
What do you think? Does this clarify anything at all?
Hi cjmcmurtrie,
This is great. Do you mind that I add your solution to the README?
Thanks,
Michael.
Hi there, thanks for this very useful tool.
This seems to work perfectly with the pre-trained Google Word2Vec model, but I am having issues processing new models that I trained using that code.
The (saved as binary) models trained with word2vec.c work correctly in the demos implemented and provided by Mikolov in the package, eg:
However, when I try to port the models into my Torch programs, I get a dictionary of vectors such as the following:
It seems to me that the code in bintot7.lua is trying to process the binary strings into ascii rather than utf-8. In your code, are you explicitly decoding the binary strings to ascii, rather than utf-8/unicode? Do you know anything about this and how we could fix it?