medallia / Word2VecJava

Word2Vec Java Port
MIT License
186 stars 81 forks source link

Accuracy rate seems to be 20% lower than the original C version #40

Open hankcs opened 8 years ago

hankcs commented 8 years ago

Hello, dear medallia staffs. Thank you for your nice Java code. It is beautiful, neatly but seems not precious.

I computed the accuracy rate, and it is 20% lower than the original version. I trained on text8 with the same parameters, which are:

Java

File f = new File("text8");
        if (!f.exists())
            throw new IllegalStateException("Please download and unzip the text8 example from http://mattmahoney.net/dc/text8.zip");
        List<String> read = Common.readToList(f);
        List<List<String>> partitioned = Lists.transform(read, new Function<String, List<String>>() {
            @Override
            public List<String> apply(String input) {
                return Arrays.asList(input.split(" "));
            }
        });

        Word2VecModel model = Word2VecModel.trainer()
                .setMinVocabFrequency(5)
                .useNumThreads(20)
                .setWindowSize(8)
                .type(NeuralNetworkType.CBOW)
                .setLayerSize(200)
                .useNegativeSamples(25)
                .setDownSamplingRate(1e-4)
                .setNumIterations(15)
                .setListener(new TrainingProgressListener() {
                    @Override public void update(Stage stage, double progress) {
                        System.out.println(String.format("%s is %.2f%% complete", Format.formatEnum(stage), progress * 100));
                    }
                })
                .train(partitioned);

        try(final OutputStream os = Files.newOutputStream(Paths.get("vectors.bin"))) {
            model.toBinFile(os);
        }

C

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15

Use the same judge program and test file:

./compute-accuracy vectors.bin 30000 < questions-words.txt

Your Java implementation:

capital-common-countries:
ACCURACY TOP1: 58.30 %  (295 / 506)
Total accuracy: 58.30 %   Semantic accuracy: 58.30 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 36.78 %  (534 / 1452)
Total accuracy: 42.34 %   Semantic accuracy: 42.34 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 12.69 %  (34 / 268)
Total accuracy: 38.77 %   Semantic accuracy: 38.77 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 25.21 %  (396 / 1571)
Total accuracy: 33.16 %   Semantic accuracy: 33.16 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 55.23 %  (169 / 306)
Total accuracy: 34.80 %   Semantic accuracy: 34.80 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 8.07 %  (61 / 756)
Total accuracy: 30.64 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.07 % 
gram2-opposite:
ACCURACY TOP1: 9.48 %  (29 / 306)
Total accuracy: 29.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.47 % 
gram3-comparative:
ACCURACY TOP1: 38.25 %  (482 / 1260)
Total accuracy: 31.13 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.63 % 
gram4-superlative:
ACCURACY TOP1: 23.91 %  (121 / 506)
Total accuracy: 30.60 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.50 % 
gram5-present-participle:
ACCURACY TOP1: 22.08 %  (219 / 992)
Total accuracy: 29.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 23.87 % 
gram6-nationality-adjective:
ACCURACY TOP1: 63.17 %  (866 / 1371)
Total accuracy: 34.50 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.25 % 
gram7-past-tense:
ACCURACY TOP1: 26.35 %  (351 / 1332)
Total accuracy: 33.47 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.64 % 
gram8-plural:
ACCURACY TOP1: 44.25 %  (439 / 992)
Total accuracy: 34.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.17 % 
gram9-plural-verbs:
ACCURACY TOP1: 18.15 %  (118 / 650)
Total accuracy: 33.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.90 % 
Questions seen / total: 12268 19544   62.77 % 

Original C implementation:

capital-common-countries:
ACCURACY TOP1: 82.81 %  (419 / 506)
Total accuracy: 82.81 %   Semantic accuracy: 82.81 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 62.26 %  (904 / 1452)
Total accuracy: 67.57 %   Semantic accuracy: 67.57 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.22 %   Semantic accuracy: 62.22 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 44.68 %  (702 / 1571)
Total accuracy: 54.96 %   Semantic accuracy: 54.96 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 75.82 %  (232 / 306)
Total accuracy: 56.52 %   Semantic accuracy: 56.52 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 17.20 %  (130 / 756)
Total accuracy: 50.40 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 17.20 % 
gram2-opposite:
ACCURACY TOP1: 21.90 %  (67 / 306)
Total accuracy: 48.71 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 18.55 % 
gram3-comparative:
ACCURACY TOP1: 64.60 %  (814 / 1260)
Total accuracy: 51.83 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 43.54 % 
gram4-superlative:
ACCURACY TOP1: 39.72 %  (201 / 506)
Total accuracy: 50.95 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 42.86 % 
gram5-present-participle:
ACCURACY TOP1: 39.52 %  (392 / 992)
Total accuracy: 49.51 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 41.99 % 
gram6-nationality-adjective:
ACCURACY TOP1: 87.24 %  (1196 / 1371)
Total accuracy: 55.08 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 53.94 % 
gram7-past-tense:
ACCURACY TOP1: 38.21 %  (509 / 1332)
Total accuracy: 52.96 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 50.73 % 
gram8-plural:
ACCURACY TOP1: 67.54 %  (670 / 992)
Total accuracy: 54.21 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 52.95 % 
gram9-plural-verbs:
ACCURACY TOP1: 37.38 %  (243 / 650)
Total accuracy: 53.32 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 51.71 % 
Questions seen / total: 12268 19544   62.77 %

Can you give me any suggestions or ideas about this? I am ready to help you if needed.

Thank you.