How to get your BLEU score?

stas00 commented 4 years ago

Could you please help me replicate the reported scores? If I follow your instructions I don't get the same scores.

What I did:

git clone https://github.com/jungokasai/deep-shallow/
cd deep-shallow

pip install gdown

mkdir -p data

# get model
gdown 'https://drive.google.com/uc?id=1x_G2cjvM1nW5hjAB8-vWxRqtQTlmIaQU'
gdown 'https://drive.google.com/uc?id=1mNufoynJ9-Zy1kJh2TA_lHm2squji0i9'
tar -xvzf trans_ende-dist_12-1_0.2.tar.gz
tar -xvzf wmt16.en-de.deep-shallow.dist.tar.gz

# Reported BLEU Score: 28.3

python generate.py wmt16.en-de.deep-shallow.dist/data-bin/ --path trans_ende-dist_12-1_0.2/checkpoint_best.pt --beam 5 --remove-bpe  --lenpen 1.0 --max-sentences 10
# [...]
# Generate test with beam=5: BLEU4 = 27.37, 58.7/33.2/21.0/13.7 (BP=1.000, ratio=1.002, syslen=63193, reflen=63078)

python generate.py wmt16.en-de.deep-shallow.dist/data-bin/ --path trans_ende-dist_12-1_0.2/checkpoint_top5_average.pt --beam 5 --remove-bpe  --lenpen 1.0 --max-sentences 10
# [...]
# Generate test with beam=5: BLEU4 = 27.66, 59.0/33.5/21.2/14.0 (BP=1.000, ratio=1.004, syslen=63331, reflen=63078)

python generate.py wmt16.en-de.deep-shallow.dist/data-bin/ --path trans_ende-dist_12-1_0.2/checkpoint_last.pt --beam 5 --remove-bpe  --lenpen 1.0 --max-sentences 10
# [...]
# Generate test with beam=5: BLEU4 = 27.67, 59.2/33.6/21.2/13.9 (BP=1.000, ratio=1.000, syslen=63052, reflen=63078)

So it appears that checkpoint_last.pt gets the best score and not checkpoint_best.pt, but it's still below the advertised score.

What am I doing wrong?

Thank you!

jungokasai commented 4 years ago

Yes, I chose checkpoint_top5_average.pt as explained in the [paper](). For BLEU calculation, we did compound splitting, following prior work, including Vaswani et al. 2017. It typically yields a 0.5+ BLEU improvement. The compound split function is available here.

stas00 commented 4 years ago

Thank you very much for sharing these details and the links, @jungokasai!

hoangcuong2011 commented 3 years ago

Hi @jungokasai,

I noticed on the German side, the Europarl corpus in the data you preprocessed (wmt16.en-de.deep-shallow.dist.tar.gz) is different to what the preprocessed data from the link: https://google.github.io/seq2seq/nmt/ and the original WMT shared task (this link). Actually you can easily see this when open your file and the original training-parallel-europarl-v7.tgz.

Do you know why this happen? Thanks a lot.

Best,

jungokasai commented 3 years ago

Sorry if I'm misunderstanding you, but it should be different because that data is the result of knowledge distillation from a transformer large model.

jungokasai / deep-shallow

How to get your BLEU score? #3