Closed hankcs closed 6 years ago
Hi Hancks! Thank you for your feed back.
I implemented this almost 2 years ago and I forgot details. I used Lucene at some points and when I did them, I had some compromise, and the result cannot be same. But as you kindly reported, the results were not good.
Thank you for your reply. This implementation is the best one in Java, since the others yield worse accuracy rates.
I will look into your early commits and compare it with the original C code carefully. There must be some different.
Thanks for you comment again.
I don't think I can take a proactive action about this issue because I'm in the current project and don't have time, but I'm happy to help you if you find some different and ask me.
I'd like to do my best to remember why I implemented in different way and will improve them, if possible.
Keep in touch!
Hi Mr. Sekiguchi,
After a long time, a friend @tiandiweizun and me finally find the difference between this version and Google's. The reason of difference scores is that the parsing logics of command line are different.
When performing -hs 0
, users want to turn HierarchicalSoftmax off, but your code actually activates it, no matter 0
or 1
follows -hs
. This logic is different with Google's. After fixing it, we find that these two versions share similar accuracy.
I've submitted a pull request, you may consider merging it for your convenience.
Thank you.
Hello, kojisekig. Thank you for your nice Java codes. This is the closest version compared to Google's original C version. But I computed the accuracy rate, and it is 10% lower than the original version. I trained on text8 with exactly the same parameters, which are:
Note I used your com.rondhuit.w2v.Text8Splitter to cut text8 to multilines, I think it does not affect the result, since both implementation's MAX_WORDS are 1000.
Then I translated compute-accuracy.c from Google's C code to Java code, and run the test with the same parameters:
The result is really surprising. Your Java implementation:
Google's C implementation:
Can you give me any suggestions or ideas about this? I am ready to help you if needed. I think we both want to make this the best Java word2vec!
You can re-run this test after merging my pull request https://github.com/kojisekig/word2vec-lucene/pull/20 .
Thank you!