google-research / multilingual-t5

Apache License 2.0
1.25k stars 129 forks source link

How mt5 checkpoint is selected on TydiQA dataset #69

Closed dorost1234 closed 3 years ago

dorost1234 commented 3 years ago

Hi This dataset only have train/validation set, I assume the reported numbers are on validation set, in that case, could you clarify please how you select the best checkpoint ? on which dataset validation is done? thank you

lintingxue commented 3 years ago

hi, Unlike other tasks, Tydiqa only has validation set, and doesnt have test set. We report the numbers on the validation set and the best checkpoint is chosen based on the average F1 across languages.

Tydiqa dev is defined in tasks. py https://github.com/google-research/multilingual-t5/blob/a694d1af8d52de5c46b649cb1b9e3b0ba8405a88/multilingual_t5/tasks.py#L283.

We use the standard Tydiqa validation in XTREME https://github.com/google-research/xtreme.

Dataset is also available in TFDS https://www.tensorflow.org/datasets/catalog/tydi_qa

On Sat, Apr 3, 2021 at 9:05 AM dorost1234 @.***> wrote:

Hi This dataset only have train/validation set, I assume the reported numbers are on validation set, in that case, could you clarify please how you select the best checkpoint ? on which dataset validation is done? thank you

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/multilingual-t5/issues/69, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIKRJR3SUBG6IALPRINJTTG44D3ANCNFSM42KMA6CQ .

dorost1234 commented 3 years ago

Hi So you mean you evaluate and report the results both on validation set? I am asking in case of zero-shot performances? this is then not zero-shot, also a wrong machine learning practice to tune the model on the same set you report the results on, could you please clarify? thanks

For other benchmarks like XNLI, did you also tune and report on the validation set?