Bert Evaluation before 3M

bitfort commented 4 years ago

We choose a to start evaluating BERT at 3M samples under the belief that everyone would converge after 3M samples. We have evidence this is not true. We want to discuss lifting this constraint (in light of our wrong assumptions writing the rules) and evaluating every 500K samples.

jonathan-cohen-nvidia commented 4 years ago

We brought this up two weeks ago and Google specifically objected to evaluating more frequently. The 3M number was proposed by a Google engineer.

I think any change at this point is bad - some companies may have already started their submission runs since there are only 10 days left.

Suggest we defer this to 0.8.

bitfort commented 4 years ago

SWG:

We had a long discussion about BERT evaluation today.

The resolution from this discussion:

We will continue with the rules as written right now in terms of evaluation for BERT for v0.7. We will revisit BERT evaluation from scratch in v0.8 (this v0.7 does not set precedence).

We want to make sure we have the following takeaways from this conversation:

We do not want to have a "throughput" benchmark. This means we do not want to lower bound evaluation frequency and we want to evaluate frequently often enough that submitters can stop when they hit target accuracy. ("No Lower Limits principal")
We need to communicate to current submitters that the README is out of date for v0.7 BERT.
We want to avoid documenting rules in READMEs; it is difficult because people may read the README instead of the rules.
We also want to avoid having multiple places the same rule is written (especially eval accuracy, eval frequency).
Rules updates more frequently and earlier.

mlcommons / training_policies

Bert Evaluation before 3M #351