kojisekig / word2vec-lucene

This tool extracts word vectors from Lucene index.
Apache License 2.0
134 stars 31 forks source link

Accuracy rate seems to be 10% lower than the original version #21

Closed hankcs closed 6 years ago

hankcs commented 8 years ago

Hello, kojisekig. Thank you for your nice Java codes. This is the closest version compared to Google's original C version. But I computed the accuracy rate, and it is 10% lower than the original version. I trained on text8 with exactly the same parameters, which are:

com.rondhuit.w2v.demo.TextFileCreateVectors -input text8.txt -output vectors.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15

Note I used your com.rondhuit.w2v.Text8Splitter to cut text8 to multilines, I think it does not affect the result, since both implementation's MAX_WORDS are 1000.

Then I translated compute-accuracy.c from Google's C code to Java code, and run the test with the same parameters:

com.rondhuit.w2v.demo.ComputeAccuracy vectors.txt 30000 questions-words.txt
./compute-accuracy vectors.bin 30000 < questions-words.txt

The result is really surprising. Your Java implementation:

CAPITAL-COMMON-COUNTRIES:
ACCURACY TOP1: 71.15 %  (360 / 506)
Total accuracy: 71.15 %   Semantic accuracy: 71.15 %   Syntactic accuracy: NaN % 
CAPITAL-WORLD:
ACCURACY TOP1: 46.42 %  (674 / 1452)
Total accuracy: 52.81 %   Semantic accuracy: 52.81 %   Syntactic accuracy: NaN % 
CURRENCY:
ACCURACY TOP1: 4.48 %  (12 / 268)
Total accuracy: 46.99 %   Semantic accuracy: 46.99 %   Syntactic accuracy: NaN % 
CITY-IN-STATE:
ACCURACY TOP1: 41.37 %  (650 / 1571)
Total accuracy: 44.67 %   Semantic accuracy: 44.67 %   Syntactic accuracy: NaN % 
FAMILY:
ACCURACY TOP1: 45.42 %  (139 / 306)
Total accuracy: 44.72 %   Semantic accuracy: 44.72 %   Syntactic accuracy: NaN % 
GRAM1-ADJECTIVE-TO-ADVERB:
ACCURACY TOP1: 10.32 %  (78 / 756)
Total accuracy: 39.37 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 10.32 % 
GRAM2-OPPOSITE:
ACCURACY TOP1: 13.40 %  (41 / 306)
Total accuracy: 37.83 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 11.21 % 
GRAM3-COMPARATIVE:
ACCURACY TOP1: 42.46 %  (535 / 1260)
Total accuracy: 38.74 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 28.17 % 
GRAM4-SUPERLATIVE:
ACCURACY TOP1: 18.38 %  (93 / 506)
Total accuracy: 37.25 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 26.41 % 
GRAM5-PRESENT-PARTICIPLE:
ACCURACY TOP1: 26.31 %  (261 / 992)
Total accuracy: 35.88 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 26.39 % 
GRAM6-NATIONALITY-ADJECTIVE:
ACCURACY TOP1: 75.13 %  (1030 / 1371)
Total accuracy: 41.67 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 39.26 % 
GRAM7-PAST-TENSE:
ACCURACY TOP1: 31.53 %  (420 / 1332)
Total accuracy: 40.40 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 37.68 % 
GRAM8-PLURAL:
ACCURACY TOP1: 61.09 %  (606 / 992)
Total accuracy: 42.17 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 40.77 % 
GRAM9-PLURAL-VERBS:
ACCURACY TOP1: 20.62 %  (134 / 650)
Total accuracy: 41.03 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 39.17 % 
Questions seen / total: 12268 19544   62.77 % 

Google's C implementation:

capital-common-countries:
ACCURACY TOP1: 82.81 %  (419 / 506)
Total accuracy: 82.81 %   Semantic accuracy: 82.81 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 62.26 %  (904 / 1452)
Total accuracy: 67.57 %   Semantic accuracy: 67.57 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.22 %   Semantic accuracy: 62.22 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 44.68 %  (702 / 1571)
Total accuracy: 54.96 %   Semantic accuracy: 54.96 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 75.82 %  (232 / 306)
Total accuracy: 56.52 %   Semantic accuracy: 56.52 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 17.20 %  (130 / 756)
Total accuracy: 50.40 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 17.20 % 
gram2-opposite:
ACCURACY TOP1: 21.90 %  (67 / 306)
Total accuracy: 48.71 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 18.55 % 
gram3-comparative:
ACCURACY TOP1: 64.60 %  (814 / 1260)
Total accuracy: 51.83 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 43.54 % 
gram4-superlative:
ACCURACY TOP1: 39.72 %  (201 / 506)
Total accuracy: 50.95 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 42.86 % 
gram5-present-participle:
ACCURACY TOP1: 39.52 %  (392 / 992)
Total accuracy: 49.51 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 41.99 % 
gram6-nationality-adjective:
ACCURACY TOP1: 87.24 %  (1196 / 1371)
Total accuracy: 55.08 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 53.94 % 
gram7-past-tense:
ACCURACY TOP1: 38.21 %  (509 / 1332)
Total accuracy: 52.96 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 50.73 % 
gram8-plural:
ACCURACY TOP1: 67.54 %  (670 / 992)
Total accuracy: 54.21 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 52.95 % 
gram9-plural-verbs:
ACCURACY TOP1: 37.38 %  (243 / 650)
Total accuracy: 53.32 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 51.71 % 
Questions seen / total: 12268 19544   62.77 %

Can you give me any suggestions or ideas about this? I am ready to help you if needed. I think we both want to make this the best Java word2vec!

You can re-run this test after merging my pull request https://github.com/kojisekig/word2vec-lucene/pull/20 .

Thank you!

kojisekig commented 8 years ago

Hi Hancks! Thank you for your feed back.

I implemented this almost 2 years ago and I forgot details. I used Lucene at some points and when I did them, I had some compromise, and the result cannot be same. But as you kindly reported, the results were not good.

hankcs commented 8 years ago

Thank you for your reply. This implementation is the best one in Java, since the others yield worse accuracy rates.

I will look into your early commits and compare it with the original C code carefully. There must be some different.

kojisekig commented 8 years ago

Thanks for you comment again.

I don't think I can take a proactive action about this issue because I'm in the current project and don't have time, but I'm happy to help you if you find some different and ask me.

I'd like to do my best to remember why I implemented in different way and will improve them, if possible.

Keep in touch!

hankcs commented 6 years ago

Hi Mr. Sekiguchi,

After a long time, a friend @tiandiweizun and me finally find the difference between this version and Google's. The reason of difference scores is that the parsing logics of command line are different.

When performing -hs 0, users want to turn HierarchicalSoftmax off, but your code actually activates it, no matter 0 or 1 follows -hs. This logic is different with Google's. After fixing it, we find that these two versions share similar accuracy.

I've submitted a pull request, you may consider merging it for your convenience.

Thank you.