[Bert] update bs448 RCP

mlcommons / logging

MLPerf™ logging library

https://mlcommons.org/en/groups/best-practices-benchmark-infra

Apache License 2.0

30 stars 45 forks source link

[Bert] update bs448 RCP #267

Closed sgpyc closed 1 year ago

sgpyc commented 2 years ago

DO NOT MERGE into the v2.0 folder.

BS448 RCP data based on Habana's HP set from the training v2.0 round. Intend to check in to the v2.1 folder.

github-actions[bot] commented 2 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

johntran-nv commented 2 years ago

Hi @sgpyc , can you confirm the exact reference code you used to generate these so others can reproduce?

Also, does this repro on TPUv3? That would be best for others to reproduce.

johntran-nv commented 2 years ago

Hi @sgpyc , here's some data for discussion.

Codebase	HW	Average Convergence (M)	Reproducible?	Hparam set
Reference, GA4	A100	2.68	Yes	Habana v2.0
Reference, GA2	TPUv3-8	2.68	Yes	RCP
Reference, GA1	TPUv3-8	2.67	No - 17/28 fail to converge	Habana v2.0
New RCP Proposal	TPUv4-32	2.41	No - code and hw access	Habana v2.0

As you can see, I think this new RCP is not reproducible by others at the moment. We're not sure what version of the code they were run with, and we don't have access to TPUv4 at this scale so couldn't even run it if we had the code. The fact that it's going significantly faster than the old RCPs in various scenarios amplifies that as an issue. Is there a way you could try on TPUv3? That would be more accessible to others for reproducibility.

johntran-nv commented 2 years ago

Hi @sgpyc , are the changes in https://github.com/mlcommons/training/pull/507 included in your latest runs? We believe input pipeline changes could be the difference, and that PR mentions some changes to the input pipeline.

pgmpablo157321 commented 2 years ago

@sgpyc v2.1 folder has been created and is currently in the master branch

sgpyc commented 2 years ago

Moved to the v2.1 folder. Please consider merging.

I did tried the HP set on TPUv3-32 with TF2.10 for 2 runs. They converged with 5768 & 5600 steps. BERT at low batch sizes seem to have a larger variance in convergence, but it's quite clear this HP set does converge.