Open bitfort opened 3 years ago
SWG:
"Touching the data before clock start" is generally prohibited - though some metadata (such as a number of examples, possible shapes in the data, etc.) can be used in the compilation process.
Generally, "JIT" or "just in time compilation" is on the clock.
We have believe that the benchmarks are smaller then the actual workloads on some systems that are used. Therefore, we don't want the overhead to dominate measurement in a way that is unrealistic.
You can exceed the current 20 minute compile time limit; this is just added to your score.
Basic tension: encourage optimization for user experience but ensure that benchmark measurement isn't dominated by the non-representative overheads.
A lot of what we call "compile time" currently is graph building and bring up of large systems -- which isn't actually "compiling" in the traditional sense.
We also want to enable diverse architectures in MLPerf; and we want to enable new submitters with new systems to submit.
If you can pre-compile and then cache (such as to disk) and then load (such as from disk) before the run, then this pre-compilation would not be limited in time.
SWG:
Historically we arrived at 20 minutes compile time based on: (a) actual data from submitters at the time, (b) the desire to encourage SW/HW makers to improve this user experience through optimization, (c) avoiding gaming this time to do things most users wouldn't do in practice.
Historically, we time benchmarks using a "clock start" and "clock stop" event. The delta between these two events is your MLPerf Score. We have been caution to avoid "clock pausing" or "discounting time" as we want to capture time-to-train including all parts of training (JIT, compiling, forward passes, backward passes, evaluation).