[BERT] Trying out fixes in init ckpt loading and input pipeline for large scale runs

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

nvcforster commented 3 years ago

@sgpyc The changes appear to be orthogonal to the differences observed with running the reference at true scale vs with GA. Considering there is still an 8% difference between those code paths, it does not seem that the original issue is resolved. Only minimal changes that address existing open functional issues should be made this late in the v1.1 schedule.

sgpyc commented 3 years ago

Consider this PR a WIP to for finding the cause(s) of convergence difference between at scale and GA. The two potential places are 1) whether to load optimizer slots when loading the init checkpoint (around L170), and 2) slight change in the input loader.

After further tests, these two changes move the convergence between at-scale (768 partitions) and GA (128 partitions & GA=6) closer, but still not the same, with BS=6912. GA=6 is the one with biggest convergence gap within GA={3, 6, 9, 18, 27}. High GA numbers, e.g. 128 partitions & GA=27 for BS6912, still show the same convergence behavior as the RCPs and submissions at scale.

I suggest do not merge, and use 128 partitions with GA to test RCP for large scale & large batch size runs, for now.

johntran-nv commented 2 years ago

Should we either merge or drop this PR?

mlcommons / training

[BERT] Trying out fixes in init ckpt loading and input pipeline for large scale runs #507