maxoodf / word2vec

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
Apache License 2.0
132 stars 24 forks source link

No Compatibility with standard word2vec formats #7

Closed oleksii-sl closed 6 years ago

oleksii-sl commented 6 years ago

Seems that it's not actually compatible I took text format vectors format generated by gensim: https://drive.google.com/file/d/1fbrVJeVlkrA8r4J-LHyjEMY2vmb3OliE/view?usp=sharing And getting this:

$ ./king2queen word2vec_format_sm.txt 
.124246: 0.763318
000148: 0.728877
.007798: 0.699526
124530: 0.696778
-0.008152: 0.696077
0.070740: 0.693602
40459: 0.67723
03326: 0.672958
38744: 0.669238

Also I took pre-trained vectors from here: https://code.google.com/archive/p/word2vec/ GoogleNews-vectors-negative300.bin.gz

Gensim works fine with google binary format:

In [1]: from gensim.models.keyedvectors import KeyedVectors

In [2]: word_vectors = KeyedVectors.load_word2vec_format('/home/oleksii/Downloads/GoogleNews-vectors-negative300.bin', binary=True)

In [3]: word_vectors.most_similar('king')
Out[3]: 
[('kings', 0.7138045430183411),
 ('queen', 0.6510956287384033),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159993410110474),
 ('sultan', 0.5864823460578918),
 ('ruler', 0.5797567367553711),
 ('princes', 0.5646552443504333),
 ('Prince_Paras', 0.5432944297790527),
 ('throne', 0.5422105193138123)]

but this command doesn't work:

$ ./king2queen ~/Downloads/GoogleNews-vectors-negative300.bin
model: wrong model file format
maxoodf commented 6 years ago

I think you are using word2vec models in the text format, but word2vec++ does not support it. Try to use the binary format instead of that.

oleksii-sl commented 6 years ago

Please, read my message more attentively. I tried both binary (from google) & text (my own) and both fail

maxoodf commented 6 years ago

I've downloaded GoogleNews-vectors-negative300.bin model and it looks like this model format differs from the original one: [words_number][sp][vector_size][nl][word1][sp][vector1][nl][word2][sp][vector2][nl]...[wordN][sp][vectorN][nl], where [sp] is the space char (0x20) and [nl] is the new line char (0x0A). But I do not see [nl] chars between [vector] and [word] at the dowloaded model. The original code line with these [nl] chars. You can make word2vec++ compatible with the dowloaded model format by changing the following line of code: offset += m_vectorSize * sizeof(float) + sizeof(char); // vector size + '\n' char to offset += m_vectorSize * sizeof(float); // vector size and line if (static_cast<off_t>(++offset + m_vectorSize * sizeof(float) + sizeof(char)) > input.size()) { to if (static_cast<off_t>(++offset + m_vectorSize * sizeof(float)) > input.size()) {

maxoodf commented 6 years ago

These changes are committed.