Closed oleksii-sl closed 6 years ago
I think you are using word2vec models in the text format, but word2vec++ does not support it. Try to use the binary format instead of that.
Please, read my message more attentively. I tried both binary (from google) & text (my own) and both fail
I've downloaded GoogleNews-vectors-negative300.bin model and it looks like this model format differs from the original one:
[words_number][sp][vector_size][nl][word1][sp][vector1][nl][word2][sp][vector2][nl]...[wordN][sp][vectorN][nl]
, where [sp]
is the space char (0x20) and [nl]
is the new line char (0x0A).
But I do not see [nl]
chars between [vector]
and [word]
at the dowloaded model.
The original code line with these [nl]
chars.
You can make word2vec++ compatible with the dowloaded model format by changing the following line of code:
offset += m_vectorSize * sizeof(float) + sizeof(char); // vector size + '\n' char
to
offset += m_vectorSize * sizeof(float); // vector size
and line
if (static_cast<off_t>(++offset + m_vectorSize * sizeof(float) + sizeof(char)) > input.size()) {
to
if (static_cast<off_t>(++offset + m_vectorSize * sizeof(float)) > input.size()) {
These changes are committed.
Seems that it's not actually compatible I took text format vectors format generated by gensim: https://drive.google.com/file/d/1fbrVJeVlkrA8r4J-LHyjEMY2vmb3OliE/view?usp=sharing And getting this:
Also I took pre-trained vectors from here: https://code.google.com/archive/p/word2vec/ GoogleNews-vectors-negative300.bin.gz
Gensim works fine with google binary format:
but this command doesn't work: