[LLM] Add exception for using different language in eval dataset

mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes

https://mlcommons.org/en/groups/training

Apache License 2.0

92 stars 66 forks source link

[LLM] Add exception for using different language in eval dataset #517

Closed sgpyc closed 1 year ago

github-actions[bot] commented 1 year ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

peladodigital commented 1 year ago

Could you add more details on how this would work?

sgpyc commented 1 year ago

This is to allow models intended for languages other than English. This change will enable potential submitter to use non-English versions of C4 as eval dataset in open submissions. There is no initial ckpt or eval target for non-English eval dataset and potential submitters need to come up their own, and specify in their submission, similar to what Google did with a 480B model to BERT in the open division.

fujitsu-notsu commented 1 year ago

The current rule about dataset (Section 6) states and allows submitters to choose any public dataset for open divisio as follows :

OPEN: Any public dataset may be used for training the model, however the evaluation data must be drawn from the benchmark dataset in a manner consistent with the reference.

However the word 'benchmark dataset' seems ambiguous. Does it mean a dataset that is defined at section 3 as reference or one that is chosed by the submitter for open division? If the former, rule fix may be needed.

sgpyc commented 1 year ago

My understanding of the quoted rule is the open division still uses the same eval dataset as the close division, hence this PR.