mlcommons / logging

MLPerf™ logging library
https://mlcommons.org/en/groups/best-practices-benchmark-infra
Apache License 2.0
29 stars 46 forks source link

Correctly implement lower convex envelope in RCP pruning logic #368

Open nv-rborkar opened 2 months ago

nv-rborkar commented 2 months ago

As the comment at https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/rcp_checker.py#L246 says, the loop does not correctly implement the "lower convex envelope" that was specified in the original RCP specification. There is an off by one error. If a point X needs to be pruned because it is greater than the interpolation of X-1 and X+1, then the point to the left of point X (X-1) needs to be retested against the interpolation of points X-2 and X+1. The increment at line 256 should be in an else clause. This bug leads to bad RCPs not getting pruned which leads to submissions getting either unfairly rejected or unfairly "scaled" when they have a global batch size near the bad RCP point