Closed KanikaKalra closed 3 years ago
Hi,
Other than Java/Python/English languages, we did not use the language token. Since PLBART is individually fine-tuned on the tasks, language-settings, it doesn't matter to use the language token.
However, we did some experiments recently on multilingual fine-tuning (REF) where we introduced the language tokens for Ruby, Go, PHP, and Javascript. You can check this script to see how we can insert new embeddings to the pre-trained checkpoint.
Thanks for your reply.
Requesting you to please verify my understanding. For Fine tuning PLBART for java to C# translation task, the input and output format is not same as mentioned in table 3 of the paper as in the language tokens will neither be appended for source nor prepended for target sequence. Eg: Input: _public _void.....[eos] Without the language token
Would it be possible for you to share a sample input output for java C# fine tuning task.
Your understanding is correct. We have already shared datasets, fine-tuned checkpoints. An example from the test set.
# c_sharp
▁public ▁override ▁void ▁Serialize ( IL ittle Endian Output ▁out 1 ){ out 1. Write Short ( field _1_ v center ); }
# java
▁public ▁void ▁serialize ( L ittle Endian Output ▁out ) ▁{ out . write Short ( field _1_ v center ); }
Hi, I am looking at translation task. The C# language is not used for pre-training PLBART. However, we are able to fine tune it for unseen language token C#. I want to understand where is the language token added for C# in data samples created for fine tuning PLBART.
The translation.py file in source directory has init method which adds language token to dictionary from langs variable which contains only - java,python,en_XX as defined in this path PLBART/scripts/code_to_code/translation/run.sh
Can you please help with this.