mlcommons / logging

MLPerf™ logging library
https://mlcommons.org/en/groups/best-practices-benchmark-infra
Apache License 2.0
29 stars 46 forks source link

updating unet3d rcp for bs 56 using habana hp #329

Closed itayhubara closed 10 months ago

itayhubara commented 10 months ago

Old BS56 RCP: mean 386400.0 (2300 epochs) mean after removing best/worst 10%: 378472 (2252.8125 epochs)

New BS56 RCP: mean 376320.0 (2240 epochs) mean after removing best/worst 10%: 342090.0 (2036.25 epochs)

github-actions[bot] commented 10 months ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

erichan1 commented 10 months ago

Please give specs on what GPU this was run on and how (eg is it reference code, and fp32)

itayhubara commented 10 months ago

This was done on Gaudi2, with bf16 the pytorch dataloader, and code similar to (but not) the reference code. Please note that Nvidia made 57 runs with the reference code and achieved similar statistics.

[1720 1740 1800 1760 1820 1720 2180 3780 2020 1740 3960 1820 2640 1960 1980 2480 1820 1740 1600 1900 2120 1740 2400 1540 1620 1940 2480 1840 3200 1760 2060 1600 1760 1980 1840 2700 1940 1660 2340 1860 1900 3280 2720 2860 1920 1280 2480 2640 2060 1820 1980 1900 3760 1720 2220 2660 2420] Average 2143.50 Mean after removing best/worst 10% were removed: 2054.46

Since RCP requires running with fp32 on reference code we have 3 options:

  1. Finish the 57 runs - if Nvidia can do that it would be great
  2. Accept the current PR based on the information above.
  3. Reject the PR and keep the old RCP

Please note that both Habana results and Nvidia results are better than the old RCP which achieved an average of 2300 and 2252 when removing the best/worst 10% (meaning Habana HPs are indeed better).

pgmpablo157321 commented 10 months ago

@itayhubara Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?

nv-rborkar commented 10 months ago

To avoid setting up a bad precedence, we should avoid merging any convergence points which are not derived from running reference.

nv-rborkar commented 9 months ago

@itayhubara can Habana create RCPs by running reference code in FP32 & create a new PR ?