Open gerryhocks opened 7 years ago
Hallo,
The code as it stands won't read a UTF-8 vocab from a word2vec binary model created using the C version of word2vec.
This is because the vocab's characters are appended to a string buffer as if a byte is a character.
A workaround/hack like this in Word2VecModel.java's fromBinFile() method gets around this issue and probably still works for single-byte characters:
byte[] buff = new byte[1024]; for (int lineno = 0; lineno < vocabSize; lineno++) { // read vocab int bpos = 0; byte b = buffer.get(); while (b != ' ') { if (b != '\n') { buff[bpos++] = b; } b = buffer.get(); } vocabs.add(new String(buff, 0, bpos, "UTF-8"));
Hallo,
The code as it stands won't read a UTF-8 vocab from a word2vec binary model created using the C version of word2vec.
This is because the vocab's characters are appended to a string buffer as if a byte is a character.
A workaround/hack like this in Word2VecModel.java's fromBinFile() method gets around this issue and probably still works for single-byte characters: