medallia / Word2VecJava

Word2Vec Java Port
MIT License
186 stars 81 forks source link

Won't read data from UTF-8 model created by C version of word2vec #44

Open gerryhocks opened 7 years ago

gerryhocks commented 7 years ago

Hallo,

The code as it stands won't read a UTF-8 vocab from a word2vec binary model created using the C version of word2vec.

This is because the vocab's characters are appended to a string buffer as if a byte is a character.

A workaround/hack like this in Word2VecModel.java's fromBinFile() method gets around this issue and probably still works for single-byte characters:

            byte[] buff = new byte[1024];
            for (int lineno = 0; lineno < vocabSize; lineno++) {
                // read vocab
                int bpos = 0;
                byte b = buffer.get();
                while (b != ' ') {
                    if (b != '\n') {
                        buff[bpos++] = b;
                    }
                    b = buffer.get();
                }
                vocabs.add(new String(buff, 0, bpos, "UTF-8"));