Closed yfpeng closed 9 years ago
Hi there!
Fantastic work!
Do you mind making a few updates though before we merge it? Please feel free to ask questions if you have any :)
src/main/java/com/medallia/word2vec/Word2VecModel.java
The javadoc for the first should say:
"Forwards to {@link #fromBinFile(...)} with the default ByteOrder.LITTLE_ENDIAN".
The javadoc for the second method should say:
"@return {@link Word2VecModel} created from the binary representation output by the open source C version of word2vec using the given byte order"
It would also be nice to give a URL to the C open source project, but don't worry about it for now since Google Code is closing down ...
Please use a ternary statement
in-line the variable?
same here
This should be done in a finally block of a try statement.
int index = firstLine.indexOf(' ');
After this call, it's usually nice to ensure that the index isn't -1. Something like:
Preconditions.checkState(index != -1, "Expected a space in the first line of file '%s': '%s'", file.getAbsolutePath(), firstLine);
src/test/java/com/medallia/word2vec/Word2VecBinReaderTest.java
Excellent unit test! However, a few things:
Let's add a tiny bit of class-level javadoc indicating that it tests converting the binary models into {@link Word2VecModel}s. Also consider adding an @see to the actual Word2VecModel.fromBinFile method.
@Test
This method's javadoc should say something like
"Tests that the Word2VecModels created from a binary and text representations are equivalent"
If you notice in the util package, there's a Common.readResourceToStream. Let's create a Common.readResourceToFile and use it here.
The vocab argument isn't being used, so let's remove that. Also, let's rename testVector to assertVectorsEqual.
Cheers, Andrew
Done! Feel free to make more comments. BTW, you can try to use the "Files changed" tab above and comment in-line. I haven't used it before, but I guess it might be easier for me to track your comment location.
Excellent work! Thanks for the updates, and the suggestion :)
Added a few comments to the files.
Cheers, Andrew
I cannot create a big-endian bin file using word2vec on my machine. I think all 80x86 processors from Intel® and their clones are little-endian. Only some Sun's and Motorala's machines, for example, are big-endian. Since Intel processors are quite dominant, shall we postpone the test case?
@wko27 Comments merged.
This function assumes that the bin file created by the C word2vec tool uses little-endian to write floats into the bin file. Each float occupies 4 bytes (32 bits) specified by the IEEE 754 standard. To read the word2vec model from a bin file, you can
Word2VecModel model = Word2VecModel.fromBinFile(file);
Alternatively, there is another function that also allows you to specify the bye order of the bin file.
Word2VecModel model = Word2VecModel.fromBinFile(file, byteOrder);