mila-iqia / blocks-examples

Examples and scripts using Blocks
MIT License
147 stars 94 forks source link

What is the target BLEU score for machine translation #90

Open zhangtemplar opened 8 years ago

zhangtemplar commented 8 years ago

Hi,

I am running the machine translation examples, however the best BLEU score reported is 7.71 with more than 400000 iterations' training. However the paper neural machine translation by jointly learning to align and translate reports over 17.82 or 26.75.

Any idea on that?

papar22 commented 8 years ago

The results reported in the mentioned paper, are the BLEU scores of the trained models computed on the test set for English to French language pairs. As you can see in the prepare_data.py here, the languge pairs are Cs to En. Obviously, different language pair leads to different results. If you want to replicate the results in the paper, you have to ensure all of your parameters and settings are the same.

zhangtemplar commented 8 years ago

@papar22 yes, but I also check the website hosting the data (http://www.statmt.org/wmt15/translation-task.html), which has a statistics of BLEU for all language pairs over all dataset. The cs to en, still has something over 25.

orhanf commented 8 years ago

@zhangtemplar, here are the reasons why you dont see the state of the art BLEU scores by running the example,

  1. In this example, we do not use the entire cs-en corpus but only a small chunk of it. Entire corpus has 12M parallel sentences, but the provided prepare data script only downloads a subset of it (newscommentary-v10) which only has 150K parallel sentences.
  2. State of the art systems are using a lot of additional methods/tricks such as large vocabularies, ensembles, unk-replacements, language models, rescoring etc. which are not implemented here in this example. Finally, download the entire data, play with hyper-parameters and give it some time :)
tnq177 commented 8 years ago

I tested with En-Fr, used the same data as in the paper, got 26.57 (the paper reported 26.75 after 5 days) after 5 days 8 hours using GPU (7-80k iterations/day). Just FYI.

zhangtemplar commented 8 years ago

@orhanf thanks for your explanation. But I still doubt why the difference is so huge, even considering I have less data and less tricks.

However, I realized they may have a different metrics for BLEU. I found the code reports five value for BLEU, namely BLEU-1, BLEU-2, BLEU-3, BLEU-4 and finally BLEU, which seems to be an exponential average of the previous four BLEUs. For BLEU I can something close to 30.

@tnq177 for 26.57, are you referring to BLEU or BLEU-1?

tnq177 commented 8 years ago

@zhangtemplar BLEU

zhangtemplar commented 8 years ago

@tnq177 interesting. I may need to try en-fr instead.

yanghoonkim commented 8 years ago

@tnq177 Did you just use the code provided by tensorflow? can you tell me the details of the model?( tensorflow version, stack size, layer size, sampled_loss(sample size), voca_size, etc.) Thanks

tnq177 commented 8 years ago

@ad26kt no, I used this blocks-examples/machine-translation code. I didn't make any change in the configuration, except using en-fr data as detailed in Bahdanau's paper.

yanghoonkim commented 8 years ago

@tnq177 oh, I thought it is tensorflow github. Thanks for replying

yanghoonkim commented 8 years ago

@tnq177 can you tell me which part of the prepare_data.py should I fix to using en-fr data as detailed in Bahdanau's paper? I tried several way, but it caused an error like :

File "prepare_data.py", line 135, in create_vocabularies
    if n.endswith(args.source)][0]]) + '.tok'
IndexError: list index out of range

I'm a little bit urgent , so your help will be really appreciated

tnq177 commented 8 years ago

@ad26kt I juse use that script as reference. You can take the data here http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/ (noted in Bahdanau's paper), preprocess the data as the authors mention in the paper (only tokenize, I believe.). Then to create vocabulary files, you can use GroundHog's preprocess.py (PREPROCESS_URL in prepare_data.py); the command should be something like python preprocess.py -d vocab_file_name.pkl -v number_of_unique_tokens_used train_file. You can use the function shuffle_parallel in prepare_data.py to shuffle the training data files. Now, write the correct configuration function in configurations.py with correct path to train/dev/test/vocab... files. Basically just follow the steps in prepare_data.py.

yanghoonkim commented 8 years ago

@tnq177 Thanks a lot!