Open hankcs opened 8 years ago
Hello, dear medallia staffs. Thank you for your nice Java code. It is beautiful, neatly but seems not precious.
I computed the accuracy rate, and it is 20% lower than the original version. I trained on text8 with the same parameters, which are:
Java
File f = new File("text8"); if (!f.exists()) throw new IllegalStateException("Please download and unzip the text8 example from http://mattmahoney.net/dc/text8.zip"); List<String> read = Common.readToList(f); List<List<String>> partitioned = Lists.transform(read, new Function<String, List<String>>() { @Override public List<String> apply(String input) { return Arrays.asList(input.split(" ")); } }); Word2VecModel model = Word2VecModel.trainer() .setMinVocabFrequency(5) .useNumThreads(20) .setWindowSize(8) .type(NeuralNetworkType.CBOW) .setLayerSize(200) .useNegativeSamples(25) .setDownSamplingRate(1e-4) .setNumIterations(15) .setListener(new TrainingProgressListener() { @Override public void update(Stage stage, double progress) { System.out.println(String.format("%s is %.2f%% complete", Format.formatEnum(stage), progress * 100)); } }) .train(partitioned); try(final OutputStream os = Files.newOutputStream(Paths.get("vectors.bin"))) { model.toBinFile(os); }
C
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15
Use the same judge program and test file:
./compute-accuracy vectors.bin 30000 < questions-words.txt
Your Java implementation:
capital-common-countries: ACCURACY TOP1: 58.30 % (295 / 506) Total accuracy: 58.30 % Semantic accuracy: 58.30 % Syntactic accuracy: nan % capital-world: ACCURACY TOP1: 36.78 % (534 / 1452) Total accuracy: 42.34 % Semantic accuracy: 42.34 % Syntactic accuracy: nan % currency: ACCURACY TOP1: 12.69 % (34 / 268) Total accuracy: 38.77 % Semantic accuracy: 38.77 % Syntactic accuracy: nan % city-in-state: ACCURACY TOP1: 25.21 % (396 / 1571) Total accuracy: 33.16 % Semantic accuracy: 33.16 % Syntactic accuracy: nan % family: ACCURACY TOP1: 55.23 % (169 / 306) Total accuracy: 34.80 % Semantic accuracy: 34.80 % Syntactic accuracy: nan % gram1-adjective-to-adverb: ACCURACY TOP1: 8.07 % (61 / 756) Total accuracy: 30.64 % Semantic accuracy: 34.80 % Syntactic accuracy: 8.07 % gram2-opposite: ACCURACY TOP1: 9.48 % (29 / 306) Total accuracy: 29.39 % Semantic accuracy: 34.80 % Syntactic accuracy: 8.47 % gram3-comparative: ACCURACY TOP1: 38.25 % (482 / 1260) Total accuracy: 31.13 % Semantic accuracy: 34.80 % Syntactic accuracy: 24.63 % gram4-superlative: ACCURACY TOP1: 23.91 % (121 / 506) Total accuracy: 30.60 % Semantic accuracy: 34.80 % Syntactic accuracy: 24.50 % gram5-present-participle: ACCURACY TOP1: 22.08 % (219 / 992) Total accuracy: 29.53 % Semantic accuracy: 34.80 % Syntactic accuracy: 23.87 % gram6-nationality-adjective: ACCURACY TOP1: 63.17 % (866 / 1371) Total accuracy: 34.50 % Semantic accuracy: 34.80 % Syntactic accuracy: 34.25 % gram7-past-tense: ACCURACY TOP1: 26.35 % (351 / 1332) Total accuracy: 33.47 % Semantic accuracy: 34.80 % Syntactic accuracy: 32.64 % gram8-plural: ACCURACY TOP1: 44.25 % (439 / 992) Total accuracy: 34.39 % Semantic accuracy: 34.80 % Syntactic accuracy: 34.17 % gram9-plural-verbs: ACCURACY TOP1: 18.15 % (118 / 650) Total accuracy: 33.53 % Semantic accuracy: 34.80 % Syntactic accuracy: 32.90 % Questions seen / total: 12268 19544 62.77 %
Original C implementation:
capital-common-countries: ACCURACY TOP1: 82.81 % (419 / 506) Total accuracy: 82.81 % Semantic accuracy: 82.81 % Syntactic accuracy: nan % capital-world: ACCURACY TOP1: 62.26 % (904 / 1452) Total accuracy: 67.57 % Semantic accuracy: 67.57 % Syntactic accuracy: nan % currency: ACCURACY TOP1: 23.13 % (62 / 268) Total accuracy: 62.22 % Semantic accuracy: 62.22 % Syntactic accuracy: nan % city-in-state: ACCURACY TOP1: 44.68 % (702 / 1571) Total accuracy: 54.96 % Semantic accuracy: 54.96 % Syntactic accuracy: nan % family: ACCURACY TOP1: 75.82 % (232 / 306) Total accuracy: 56.52 % Semantic accuracy: 56.52 % Syntactic accuracy: nan % gram1-adjective-to-adverb: ACCURACY TOP1: 17.20 % (130 / 756) Total accuracy: 50.40 % Semantic accuracy: 56.52 % Syntactic accuracy: 17.20 % gram2-opposite: ACCURACY TOP1: 21.90 % (67 / 306) Total accuracy: 48.71 % Semantic accuracy: 56.52 % Syntactic accuracy: 18.55 % gram3-comparative: ACCURACY TOP1: 64.60 % (814 / 1260) Total accuracy: 51.83 % Semantic accuracy: 56.52 % Syntactic accuracy: 43.54 % gram4-superlative: ACCURACY TOP1: 39.72 % (201 / 506) Total accuracy: 50.95 % Semantic accuracy: 56.52 % Syntactic accuracy: 42.86 % gram5-present-participle: ACCURACY TOP1: 39.52 % (392 / 992) Total accuracy: 49.51 % Semantic accuracy: 56.52 % Syntactic accuracy: 41.99 % gram6-nationality-adjective: ACCURACY TOP1: 87.24 % (1196 / 1371) Total accuracy: 55.08 % Semantic accuracy: 56.52 % Syntactic accuracy: 53.94 % gram7-past-tense: ACCURACY TOP1: 38.21 % (509 / 1332) Total accuracy: 52.96 % Semantic accuracy: 56.52 % Syntactic accuracy: 50.73 % gram8-plural: ACCURACY TOP1: 67.54 % (670 / 992) Total accuracy: 54.21 % Semantic accuracy: 56.52 % Syntactic accuracy: 52.95 % gram9-plural-verbs: ACCURACY TOP1: 37.38 % (243 / 650) Total accuracy: 53.32 % Semantic accuracy: 56.52 % Syntactic accuracy: 51.71 % Questions seen / total: 12268 19544 62.77 %
Can you give me any suggestions or ideas about this? I am ready to help you if needed.
Thank you.
Hello, dear medallia staffs. Thank you for your nice Java code. It is beautiful, neatly but seems not precious.
I computed the accuracy rate, and it is 20% lower than the original version. I trained on text8 with the same parameters, which are:
Java
C
Use the same judge program and test file:
Your Java implementation:
Original C implementation:
Can you give me any suggestions or ideas about this? I am ready to help you if needed.
Thank you.