Closed itayhubara closed 10 months ago
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
Please give specs on what GPU this was run on and how (eg is it reference code, and fp32)
This was done on Gaudi2, with bf16 the pytorch dataloader, and code similar to (but not) the reference code. Please note that Nvidia made 57 runs with the reference code and achieved similar statistics.
[1720 1740 1800 1760 1820 1720 2180 3780 2020 1740 3960 1820 2640 1960 1980 2480 1820 1740 1600 1900 2120 1740 2400 1540 1620 1940 2480 1840 3200 1760 2060 1600 1760 1980 1840 2700 1940 1660 2340 1860 1900 3280 2720 2860 1920 1280 2480 2640 2060 1820 1980 1900 3760 1720 2220 2660 2420] Average 2143.50 Mean after removing best/worst 10% were removed: 2054.46
Since RCP requires running with fp32 on reference code we have 3 options:
Please note that both Habana results and Nvidia results are better than the old RCP which achieved an average of 2300 and 2252 when removing the best/worst 10% (meaning Habana HPs are indeed better).
@itayhubara Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?
To avoid setting up a bad precedence, we should avoid merging any convergence points which are not derived from running reference.
@itayhubara can Habana create RCPs by running reference code in FP32 & create a new PR ?
Old BS56 RCP: mean 386400.0 (2300 epochs) mean after removing best/worst 10%: 378472 (2252.8125 epochs)
New BS56 RCP: mean 376320.0 (2240 epochs) mean after removing best/worst 10%: 342090.0 (2036.25 epochs)