Reproducing results for IWSLT En-De

neerajgangwar commented 1 year ago

Hi,

I am trying to reproduce the results on IWSLT En-De. I followed the instructions mentioned in the README file but was not able to achieve the BLEU score mentioned in the paper. To run the code, I made some fixes: Changes: https://github.com/neerajgangwar/tree_transformer/tree/fixes Diff: https://github.com/nxphi47/tree_transformer/pull/5

It would be great if you could help me reproduce the results mentioned in the paper.

Thank you!

nxphi47 commented 1 year ago

Hi,

Thank you for your interest in the paper. There're few possible reasons.

Many dependencies such as bleu calculation (which is not sacrebleu but a bleu with special tokenization and post-processing developed by google t2t back then to be consistent with Vaswani et al 2017), fairseq have changed significantly since then. The most discrepancy can be from different bleu implementation.
Many dependencies are also gone / deprecated.
Perhaps the experiments were not done correctly, pay attention to batch size, gradient accumulation and number of gpus. Generally, we should mimic 128-GPU setup (8*16) with at least 2048 tokens batch size. Generally should be as high as possible. Training longer and apply checkpoint averaging will help.
This repo is also not maintained (my fault, I'm sorry for that), and I lost many details.

For correct experiments, I expect to achieve performance at least 1 BLEU higher than the transformer baseline, when compared with consistent BLEU implementation. I suggests you simply reimplement the model-part of the codebase here in your own training and eval pipeline with latest SacreBLEU implementation for your work. If you observe lost divergence, significantly lower performance, then likely there is a bug in the code or incorrect setup.

Hope this helps. Sorry for the inconvenience.

neerajgangwar commented 1 year ago

Hi @nxphi47,

Thank you for your response. I am running the code in this repo and am using the instructions provided in the README file. The only modifications I have made are to fix the issues where the code breaks. But I could not reproduce the results for IWSLT En-De. Let me recheck the settings and see if the above-mentioned points are taken care of. I will get back to you with the results.

Thanks again for your suggestions. Appreciate the quick response!

neerajgangwar commented 1 year ago

Hi @nxphi47,

I exported the model provided in this repository and ran it with fairseq v0.12.3. I used the model dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubenc_allcross_hier as mentioned in the README file. The code is present here.

For training, I used a batch size of 4096 tokens and ran the training for 61000 steps with other parameters as mentioned in the README file and kept the parameters the same between Transformer and Tree Transformer. I also tried five different seeds to ensure the random initialization was not an issue. Evaluating the best checkpoint resulted in a BLEU score of $27.892 \pm 0.060$ with Transformer and $27.502 \pm 0.104$ with Tree Transformer. I also tried averaging the last 10 checkpoints. It resulted in a BLEU score of $28.528 \pm 0.064$ with Transformer and $28.088 \pm 0.123$ with Tree Transformer.

I am using the config mentioned in the README file, and I am not sure if I am missing any other configuration you used for the results in the paper. Any suggestions or input will be appreciated.

Thank you!

nxphi47 / tree_transformer

Reproducing results for IWSLT En-De #6