pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.48k stars 816 forks source link

Binary word2vec (GoogleNews-vectors-negative300.bin) Decoding Error #338

Open tuzhucheng opened 6 years ago

tuzhucheng commented 6 years ago

I am getting an exception when I use GoogleNews-vectors-negative300.bin with torchtext v0.2.3 and Python 3.6.

vectors = Vectors(name='GoogleNews-vectors-negative300.bin', cache='/directory/to/word2vec')

The exception is ValueError: could not convert string to float. For each line of the utf-8 binary word2vec file, torchtext currently splits it into word and the word vector like this:

entries = line.rstrip().split(b" " if binary_lines else " ")
word, entries = entries[0], entries[1:]

However, my entries after this block of code executes has a length of 3 for the first non-header line in GoogleNews-vectors-negative300.bin, which corresponds to </s>.

I propose we first decode each line and then split by " ". What do you think? Thanks!

zhangguanheng66 commented 5 years ago

@tuzhucheng do you still have a similar issue under 0.4.0? If so, please attach a script to reproduce the error and I'm happy to take a look.