Closed sgpyc closed 1 year ago
Could you add more details on how this would work?
This is to allow models intended for languages other than English. This change will enable potential submitter to use non-English versions of C4 as eval dataset in open submissions. There is no initial ckpt or eval target for non-English eval dataset and potential submitters need to come up their own, and specify in their submission, similar to what Google did with a 480B model to BERT in the open division.
The current rule about dataset (Section 6) states and allows submitters to choose any public dataset for open divisio as follows :
OPEN: Any public dataset may be used for training the model, however the evaluation data must be drawn from the benchmark dataset in a manner consistent with the reference.
However the word 'benchmark dataset' seems ambiguous. Does it mean a dataset that is defined at section 3 as reference or one that is chosed by the submitter for open division? If the former, rule fix may be needed.
My understanding of the quoted rule is the open division still uses the same eval dataset as the close division, hence this PR.
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅