Closed urialon closed 2 years ago
Hi Uri,
Thanks for your interest in our work. It seems this is due to some training randomness. I've rerun another experiment and find the model early stopped at epoch 28 and reproduce a similar result of bleu-4: 84.30, em: 65.50
.
Besides, for this code translation task, one empirical finding is that the more overfitted checkpoints often work better than checkpoints selected by dev bleu score. I would also suggest you to further tune some hyper-parameters such as early stop patience.
Thank you!
On Sat, Mar 12, 2022 at 03:17 WANG Yue @.***> wrote:
Hi Uri,
Thanks for your interest in our work. It seems this is due to some training randomness. I've rerun another experiment and find the model early stopped at epoch 28 and reproduce a similar result of bleu-4: 84.30, em: 65.50.
Besides, for this code translation task, one empirical finding is that the more overfitted checkpoints often work better than checkpoints selected by dev bleu score. I would also suggest you to further tune some hyper-parameters such as early stop patience.
— Reply to this email directly, view it on GitHub https://github.com/salesforce/CodeT5/issues/35#issuecomment-1065839331, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMDQPC2QJG5GJSJNERTU7RHI3ANCNFSM5QPZMKUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you authored the thread.Message ID: @.***>
Hi, Thank you for releasing the model and this repository!
I am trying to reproduce the Java->C# translation results from the paper using CodeT5-base. I ran it according to the instructions, and in the 15th epoch I got dev results of:
The model early-stopped itself and evaluated on the test set, and these are the results on the test set:
However, the results reported in the paper are bleu: 84.03 and EM: 65.90.
The BLEU results are sufficiently close to the reported results, but EM is 2.2% from the paper numbers. Do you have an idea whether the reported settings are different from the settings in the paper, or is it just training randomness?
These are the settings from my logs:
Thanks! Uri