Open sgpyc opened 3 years ago
@sgpyc The changes appear to be orthogonal to the differences observed with running the reference at true scale vs with GA. Considering there is still an 8% difference between those code paths, it does not seem that the original issue is resolved. Only minimal changes that address existing open functional issues should be made this late in the v1.1 schedule.
Consider this PR a WIP to for finding the cause(s) of convergence difference between at scale and GA. The two potential places are 1) whether to load optimizer slots when loading the init checkpoint (around L170), and 2) slight change in the input loader.
After further tests, these two changes move the convergence between at-scale (768 partitions) and GA (128 partitions & GA=6) closer, but still not the same, with BS=6912. GA=6 is the one with biggest convergence gap within GA={3, 6, 9, 18, 27}. High GA numbers, e.g. 128 partitions & GA=27 for BS6912, still show the same convergence behavior as the RCPs and submissions at scale.
I suggest do not merge, and use 128 partitions with GA to test RCP for large scale & large batch size runs, for now.
Should we either merge or drop this PR?
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅