Reproducing Translation results

urialon commented 2 years ago

Hi, Thank you for releasing the model and this repository!

I am trying to reproduce the Java->C# translation results from the paper using CodeT5-base. I ran it according to the instructions, and in the 15th epoch I got dev results of:

[15] Best bleu+em: 150.38 (bleu: 82.18, em: 68.20)

The model early-stopped itself and evaluated on the test set, and these are the results on the test set:

[best-bleu] bleu-4: 83.83, em: 63.7000, codebleu: 0.0000

However, the results reported in the paper are bleu: 84.03 and EM: 65.90.

The BLEU results are sufficiently close to the reported results, but EM is 2.2% from the paper numbers. Do you have an idea whether the reported settings are different from the settings in the paper, or is it just training randomness?

These are the settings from my logs:

03/10/2022 15:22:47 - INFO - __main__ -   Namespace(task='translate', sub_task='java-cs', lang='c_sharp', eval_task='', model_type='codet5', add_lang_ids=False, data_num=-1, start_epoch=0, num_train_epochs=100, patience=5, cache_path='saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100/cache_data', summary_dir='tensorboard', data_dir='/projects/tir4/users/urialon/CodeT5/data', res_dir='saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100/prediction', res_fn='results/translate_codet5_base.txt', add_task_prefix=False, save_last_checkpoints=True, always_save_model=True, do_eval_bleu=True, model_name_or_path='Salesforce/codet5-base', output_dir='saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100', load_model_path=None, train_filename=None, dev_filename=None, test_filename=None, config_name='', tokenizer_name='Salesforce/codet5-base', max_source_length=320, max_target_length=256, do_train=True, do_eval=True, do_test=True, do_lower_case=False, no_cuda=False, train_batch_size=25, eval_batch_size=25, gradient_accumulation_steps=1, learning_rate=5e-05, beam_size=10, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, save_steps=-1, log_steps=-1, max_steps=-1, eval_steps=-1, train_steps=-1, warmup_steps=1000, local_rank=-1, seed=1234)
03/10/2022 15:22:47 - WARNING - configs -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, cpu count: 32
03/10/2022 15:22:52 - INFO - models -   Finish loading model [223M] from Salesforce/codet5-base
03/10/2022 15:23:14 - INFO - utils -   Read 10300 examples, avg src len: 13, avg trg len: 15, max src len: 136, max trg len: 118
03/10/2022 15:23:14 - INFO - utils -   [TOKENIZE] avg src len: 45, avg trg len: 56, max src len: 391, max trg len: 404
03/10/2022 15:23:14 - INFO - utils -   Load cache data from saved_models/translate/java-cs/codet5_base_all_lr5_bs25_src320_trg256_pat5_e100/cache_data/train_all.pt
/home/ualon/.conda/envs/3090/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
03/10/2022 15:23:14 - INFO - __main__ -   ***** Running training *****
03/10/2022 15:23:14 - INFO - __main__ -     Num examples = 10300
03/10/2022 15:23:14 - INFO - __main__ -     Batch size = 25
03/10/2022 15:23:14 - INFO - __main__ -     Batch num = 412
03/10/2022 15:23:14 - INFO - __main__ -     Num epoch = 100

Thanks! Uri

yuewang-cuhk commented 2 years ago

Hi Uri,

Thanks for your interest in our work. It seems this is due to some training randomness. I've rerun another experiment and find the model early stopped at epoch 28 and reproduce a similar result of bleu-4: 84.30, em: 65.50.

Besides, for this code translation task, one empirical finding is that the more overfitted checkpoints often work better than checkpoints selected by dev bleu score. I would also suggest you to further tune some hyper-parameters such as early stop patience.

urialon commented 1 year ago

Thank you!

On Sat, Mar 12, 2022 at 03:17 WANG Yue @.***> wrote:

Hi Uri,

Thanks for your interest in our work. It seems this is due to some training randomness. I've rerun another experiment and find the model early stopped at epoch 28 and reproduce a similar result of bleu-4: 84.30, em: 65.50.

Besides, for this code translation task, one empirical finding is that the more overfitted checkpoints often work better than checkpoints selected by dev bleu score. I would also suggest you to further tune some hyper-parameters such as early stop patience.

— Reply to this email directly, view it on GitHub https://github.com/salesforce/CodeT5/issues/35#issuecomment-1065839331, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMDQPC2QJG5GJSJNERTU7RHI3ANCNFSM5QPZMKUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

salesforce / CodeT5

Reproducing Translation results #35