medallia / Word2VecJava

Word2Vec Java Port
MIT License
186 stars 81 forks source link

Split large model files across multiple buffers. #29

Open jkinkead opened 9 years ago

jkinkead commented 9 years ago

Fixes #28 by splitting large models into multiple buffers in memory.

Note on formatting: I used tabs throughout, as that seemed to be more common in the files. SearcherImpl was mixed-use. I also tried to keep for-loop (for( vs for () and naming similar to the local code. This was mixed-use as well.

dirkgr commented 9 years ago

Ping?

krishnad commented 9 years ago

This fork breaks in the toBinFile method. for(int i = 0; i < vocab.size(); ++i) { out.write(String.format("%s ", vocab.get(i)).getBytes(cs));

        DoubleBuffer vectorBuffer = vectors[i / vectorsPerBuffer];

vectors ( with the plural) is a DoubleBuffer[] and is unlikely to be as large as the vocab size. In most of my test cases, vectors is a DoubleBuffer array of length 1.