medallia / Word2VecJava

Word2Vec Java Port
MIT License
186 stars 81 forks source link

Read from bin file #8

Closed yfpeng closed 9 years ago

yfpeng commented 9 years ago

This function assumes that the bin file created by the C word2vec tool uses little-endian to write floats into the bin file. Each float occupies 4 bytes (32 bits) specified by the IEEE 754 standard. To read the word2vec model from a bin file, you can

Word2VecModel model = Word2VecModel.fromBinFile(file);

Alternatively, there is another function that also allows you to specify the bye order of the bin file.

Word2VecModel model = Word2VecModel.fromBinFile(file, byteOrder);

wko27 commented 9 years ago

Hi there!

Fantastic work!

Do you mind making a few updates though before we merge it? Please feel free to ask questions if you have any :)

src/main/java/com/medallia/word2vec/Word2VecModel.java

The javadoc for the first should say:

"Forwards to {@link #fromBinFile(...)} with the default ByteOrder.LITTLE_ENDIAN".

The javadoc for the second method should say:

"@return {@link Word2VecModel} created from the binary representation output by the open source C version of word2vec using the given byte order"

It would also be nice to give a URL to the C open source project, but don't worry about it for now since Google Code is closing down ...

Please use a ternary statement

in-line the variable?

same here

This should be done in a finally block of a try statement.

int index = firstLine.indexOf(' ');

After this call, it's usually nice to ensure that the index isn't -1. Something like:

Preconditions.checkState(index != -1, "Expected a space in the first line of file '%s': '%s'", file.getAbsolutePath(), firstLine);

src/test/java/com/medallia/word2vec/Word2VecBinReaderTest.java

Excellent unit test! However, a few things:

Let's add a tiny bit of class-level javadoc indicating that it tests converting the binary models into {@link Word2VecModel}s. Also consider adding an @see to the actual Word2VecModel.fromBinFile method.

@Test

This method's javadoc should say something like

"Tests that the Word2VecModels created from a binary and text representations are equivalent"

If you notice in the util package, there's a Common.readResourceToStream. Let's create a Common.readResourceToFile and use it here.

The vocab argument isn't being used, so let's remove that. Also, let's rename testVector to assertVectorsEqual.

Cheers, Andrew

yfpeng commented 9 years ago

Done! Feel free to make more comments. BTW, you can try to use the "Files changed" tab above and comment in-line. I haven't used it before, but I guess it might be easier for me to track your comment location.

wko27 commented 9 years ago

Excellent work! Thanks for the updates, and the suggestion :)

Added a few comments to the files.

Cheers, Andrew

yfpeng commented 9 years ago

I cannot create a big-endian bin file using word2vec on my machine. I think all 80x86 processors from Intel® and their clones are little-endian. Only some Sun's and Motorala's machines, for example, are big-endian. Since Intel processors are quite dominant, shall we postpone the test case?

yfpeng commented 9 years ago

@wko27 Comments merged.