Closed dorost1234 closed 3 years ago
hi, Unlike other tasks, Tydiqa only has validation set, and doesnt have test set. We report the numbers on the validation set and the best checkpoint is chosen based on the average F1 across languages.
Tydiqa dev is defined in tasks. py https://github.com/google-research/multilingual-t5/blob/a694d1af8d52de5c46b649cb1b9e3b0ba8405a88/multilingual_t5/tasks.py#L283.
We use the standard Tydiqa validation in XTREME https://github.com/google-research/xtreme.
Dataset is also available in TFDS https://www.tensorflow.org/datasets/catalog/tydi_qa
On Sat, Apr 3, 2021 at 9:05 AM dorost1234 @.***> wrote:
Hi This dataset only have train/validation set, I assume the reported numbers are on validation set, in that case, could you clarify please how you select the best checkpoint ? on which dataset validation is done? thank you
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/multilingual-t5/issues/69, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOIKRJR3SUBG6IALPRINJTTG44D3ANCNFSM42KMA6CQ .
Hi So you mean you evaluate and report the results both on validation set? I am asking in case of zero-shot performances? this is then not zero-shot, also a wrong machine learning practice to tune the model on the same set you report the results on, could you please clarify? thanks
For other benchmarks like XNLI, did you also tune and report on the validation set?
Hi This dataset only have train/validation set, I assume the reported numbers are on validation set, in that case, could you clarify please how you select the best checkpoint ? on which dataset validation is done? thank you