koreyou / word_embedding_loader

Loaders and savers for different implentations of word embedding
MIT License
3 stars 2 forks source link

Loading binary word2vec format fails on python3 #6

Closed koreyou closed 7 years ago

koreyou commented 7 years ago

Loading GoogleNews-vectors-negative300.bin from original word2vec website fails. Loaded word embedding have shape of (0, 4687957).

from word_embedding_loader import WordEmbedding
wv = WordEmbedding.load(''GoogleNews-vectors-negative300.bin")
print(wv.vectors.shape)
#(0, 4687957)

It reproduces at least on python 3.4.5 and on 3.5.2.

koreyou commented 7 years ago

It seems that problem lies in here:

https://github.com/koreyou/word_embedding_loader/blob/develop/word_embedding_loader/loader/word2vec_bin.pyx#L107-L109

    cdef long long words, size
    fscanf(f, '%lld', &words)
    fscanf(f, '%lld', &size)

inspecting words and size prints out 0 and 4687957