mlcommons / logging

MLPerf™ logging library
https://mlcommons.org/en/groups/best-practices-benchmark-infra
Apache License 2.0
30 stars 45 forks source link

[Bert] update bs448 RCP #267

Closed sgpyc closed 1 year ago

sgpyc commented 2 years ago

DO NOT MERGE into the v2.0 folder.

BS448 RCP data based on Habana's HP set from the training v2.0 round. Intend to check in to the v2.1 folder.

github-actions[bot] commented 2 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

johntran-nv commented 2 years ago

Hi @sgpyc , can you confirm the exact reference code you used to generate these so others can reproduce?

Also, does this repro on TPUv3? That would be best for others to reproduce.

johntran-nv commented 2 years ago

Hi @sgpyc , here's some data for discussion.

Codebase HW Average Convergence (M) Reproducible? Hparam set
Reference, GA4 A100 2.68 Yes Habana v2.0
Reference, GA2 TPUv3-8 2.68 Yes RCP
Reference, GA1 TPUv3-8 2.67 No - 17/28 fail to converge Habana v2.0
New RCP Proposal TPUv4-32 2.41 No - code and hw access Habana v2.0

As you can see, I think this new RCP is not reproducible by others at the moment. We're not sure what version of the code they were run with, and we don't have access to TPUv4 at this scale so couldn't even run it if we had the code. The fact that it's going significantly faster than the old RCPs in various scenarios amplifies that as an issue. Is there a way you could try on TPUv3? That would be more accessible to others for reproducibility.

johntran-nv commented 2 years ago

Hi @sgpyc , are the changes in https://github.com/mlcommons/training/pull/507 included in your latest runs? We believe input pipeline changes could be the difference, and that PR mentions some changes to the input pipeline.

pgmpablo157321 commented 2 years ago

@sgpyc v2.1 folder has been created and is currently in the master branch

sgpyc commented 2 years ago

Moved to the v2.1 folder. Please consider merging.

I did tried the HP set on TPUv3-32 with TF2.10 for 2 runs. They converged with 5768 & 5600 steps. BERT at low batch sizes seem to have a larger variance in convergence, but it's quite clear this HP set does converge.