medallia / Word2VecJava

Word2Vec Java Port
MIT License
186 stars 81 forks source link

Large Bin File Error #28

Open stanlivshin opened 9 years ago

stanlivshin commented 9 years ago

DoubleBuffer vectors = ByteBuffer.allocateDirect(vocabSize * layerSize * 8).asDoubleBuffer();

this line was throwing error since the int multiplication vocabSize * layerSize * 8 > Integer.MAX_VALUE so negative number was passed into the method.

As a dirty fix i change it to the following:

DoubleBuffer vectors = DoubleBuffer.allocate(1000000000);

wko27 commented 9 years ago

Hi, do you mind opening a pull request?

I'd suggest a more proper fix as:

long bufferSize = vocabSize * layerSize * 8;
Preconditions.checkState(bufferSize <= Integer.MAX_INT, "Unable to allocate a buffer size of %s, vocab size is %s, layerSize is %s", bufferSize, vocabSize, layerSize);
DoubleBuffer vectors = DoubleBuffer.allocate(bufferSize);
jkinkead commented 9 years ago

I ran into this as well. Note that this will still only let you go as big as 16G worth of vectors, and you lose the memory mapping from calling allocateDirect. It might be better to shard the vectors into 1 or 2 G direct byte buffers, and let the model call in to the correct one.

@dirkgr FYI, side-effect of your efficiency fixes causes the max number of doubles to be 2^28 - 1, or about 250 million. Google's Google News vector file contains 3 million vectors of 300 entries, or 900 million doubles, and can't be loaded by this new code.

jkinkead commented 9 years ago

I'm going to look into a fix for this.

dirkgr commented 9 years ago

Thanks for looking at it. Let me know if you want me to contribute in some way. The limit is the number of doubles you can put into a DoubleBuffer, right? Because Java can't map more than 2GB of memory at a time?

jkinkead commented 9 years ago

I don't know if Java can't, but the API for ByteBuffer only accepts an int . . . so obviously you're capped at Integer.MAX_VALUE for what you can build.

jkinkead commented 9 years ago

See PR #29 @wko27

scobrown commented 8 years ago

Seems like this would benefit from using nd4j, if nothing else you could use their DoubleBuffer which supports longs for the length https://github.com/deeplearning4j/nd4j/blob/master/nd4j-buffer/src/main/java/org/nd4j/linalg/api/buffer/BaseDataBuffer.java

If there is interest, I could maybe try it out and submit a pull request. Not sure how you feel about adding that dependency