Need help to understand unseen language token for translation task

wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

https://arxiv.org/abs/2103.06333

MIT License

186 stars 34 forks source link

Need help to understand unseen language token for translation task #24

Closed KanikaKalra closed 3 years ago

KanikaKalra commented 3 years ago

Hi, I am looking at translation task. The C# language is not used for pre-training PLBART. However, we are able to fine tune it for unseen language token C#. I want to understand where is the language token added for C# in data samples created for fine tuning PLBART.

The translation.py file in source directory has init method which adds language token to dictionary from langs variable which contains only - java,python,en_XX as defined in this path PLBART/scripts/code_to_code/translation/run.sh

Can you please help with this.

wasiahmad commented 3 years ago

Hi,

Other than Java/Python/English languages, we did not use the language token. Since PLBART is individually fine-tuned on the tasks, language-settings, it doesn't matter to use the language token.

However, we did some experiments recently on multilingual fine-tuning (REF) where we introduced the language tokens for Ruby, Go, PHP, and Javascript. You can check this script to see how we can insert new embeddings to the pre-trained checkpoint.

KanikaKalra commented 3 years ago

Thanks for your reply.

Requesting you to please verify my understanding. For Fine tuning PLBART for java to C# translation task, the input and output format is not same as mentioned in table 3 of the paper as in the language tokens will neither be appended for source nor prepended for target sequence. Eg: Input: _public _void.....[eos] Without the language token

Would it be possible for you to share a sample input output for java C# fine tuning task.

wasiahmad commented 3 years ago

Your understanding is correct. We have already shared datasets, fine-tuned checkpoints. An example from the test set.

# c_sharp
▁public ▁override ▁void ▁Serialize ( IL ittle Endian Output ▁out 1 ){ out 1. Write Short ( field _1_ v center ); }

# java
▁public ▁void ▁serialize ( L ittle Endian Output ▁out ) ▁{ out . write Short ( field _1_ v center ); }