The exception is ValueError: could not convert string to float. For each line of the utf-8 binary word2vec file, torchtext currently splits it into word and the word vector like this:
entries = line.rstrip().split(b" " if binary_lines else " ")
word, entries = entries[0], entries[1:]
However, my entries after this block of code executes has a length of 3 for the first non-header line in GoogleNews-vectors-negative300.bin, which corresponds to </s>.
I propose we first decode each line and then split by " ". What do you think? Thanks!
I am getting an exception when I use
GoogleNews-vectors-negative300.bin
with torchtext v0.2.3 and Python 3.6.The exception is
ValueError: could not convert string to float
. For each line of the utf-8 binary word2vec file, torchtext currently splits it into word and the word vector like this:However, my
entries
after this block of code executes has a length of 3 for the first non-header line in GoogleNews-vectors-negative300.bin, which corresponds to</s>
.I propose we first decode each line and then split by " ". What do you think? Thanks!