Clarify BERT Eval Boundary

Proposal:

We seek to clarify how to handle batch-boundaries for BERT Evaluation.

(1) The rules state: "BERT Starting at 3M samples, then every 500K samples"

(2) We believe the following section of the rules also apply: 9.5. Equivalence exceptions "If data set size is not evenly divisible by batch size, one of several techniques may be used. The last batch in an epoch may be composed of the remaining samples in the epoch, may be padded, or may be a mixed batch composed of samples from the end of one epoch and the start of the next. If the mixed batch technique is used, quality for the ending epoch must be evaluated after the mixed batch. If the padding technique is used, the first batch may be padded instead of the last batch."

(3) We believe that the BERT reference evaluates every 499,992 samples (which is every 20833 batches with a batch size of 24). Thus, we believe the reference is "rounding down", which is not inline with our understanding of the batch-boundary rules.

(4) We believe an appropriate way to handle BERT evaluation is to fill the last batch of the eval window using the "mixed batch" approach (i.e. adding extra examples to fill up the last batch). There may be other appropriate ways of handling this.

mlcommons / training_policies

Clarify BERT Eval Boundary #362