Closed sgpyc closed 1 year ago
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
Hi @sgpyc , can you confirm the exact reference code you used to generate these so others can reproduce?
Also, does this repro on TPUv3? That would be best for others to reproduce.
Hi @sgpyc , here's some data for discussion.
Codebase | HW | Average Convergence (M) | Reproducible? | Hparam set |
---|---|---|---|---|
Reference, GA4 | A100 | 2.68 | Yes | Habana v2.0 |
Reference, GA2 | TPUv3-8 | 2.68 | Yes | RCP |
Reference, GA1 | TPUv3-8 | 2.67 | No - 17/28 fail to converge | Habana v2.0 |
New RCP Proposal | TPUv4-32 | 2.41 | No - code and hw access | Habana v2.0 |
As you can see, I think this new RCP is not reproducible by others at the moment. We're not sure what version of the code they were run with, and we don't have access to TPUv4 at this scale so couldn't even run it if we had the code. The fact that it's going significantly faster than the old RCPs in various scenarios amplifies that as an issue. Is there a way you could try on TPUv3? That would be more accessible to others for reproducibility.
Hi @sgpyc , are the changes in https://github.com/mlcommons/training/pull/507 included in your latest runs? We believe input pipeline changes could be the difference, and that PR mentions some changes to the input pipeline.
@sgpyc v2.1 folder has been created and is currently in the master branch
Moved to the v2.1 folder. Please consider merging.
I did tried the HP set on TPUv3-32 with TF2.10 for 2 runs. They converged with 5768 & 5600 steps. BERT at low batch sizes seem to have a larger variance in convergence, but it's quite clear this HP set does converge.
DO NOT MERGE into the v2.0 folder.
BS448 RCP data based on Habana's HP set from the training v2.0 round. Intend to check in to the v2.1 folder.