Open nv-jinhosuh opened 9 months ago
@ashwin @nvzhihanj @arjunsuresh @psyhtest @ckstanton @pgmpablo157321 FYI
Thank you @nv-jinhosuh for bringing up this discussion. It is clear why we need equal issue mode. Other than longer runtime -- it is not terrible as the accuracy run takes the same time anyway, we haven't seen any issue with it. But someone not familiar with equal issue mode can be surprised by seeing strange min_query_count
values in the mlperf_log. I believe all we need is good documentation and rule for equal issue mode so that everyone is aware of it.
We were using max_duration
for singlestream runs and equal issue mode means this no longer can be done. This is not a major issue as anyway for offline scenario we cannot use max_duration.
I would like to add another related point here for discussion - can submitters override the min_query_count
parameter? It would be good to document what all parameters the submitters can override and what all need to be controlled solely by loadgen.
There was a discussion on how to make Early Stopping more user friendly in https://github.com/mlcommons/inference/issues/1095
This issue was closed without being added into real policy and implementation though. And in order to get there, we need opinions from the statistics expert like @ckstanton.
Today, we use Equal Issue mode for a couple of different reasons:
In short, I think equal issue mode should be enabled for all the scenarios if the benchmark handles non-uniform workload samples; it prohibits metrics from being suffered by the high variance upon random seeds, without running the test extensively long. We would need extensive discussions on this matter, especially in the connection to Early Stopping. We may want to revisit the above issue https://github.com/mlcommons/inference/issues/1095 and discuss how to make ES more user friendly as well.
There is a concern that Equal Issue mode enforces users to run test very long. We also want to attack this problem, but in the way the solution is legit for metrics to capture the behaviors of the networks on input datasets, and it may involve separate discussions like reducing the input sample set size.
FWIW There's also a concern about Early Stopping on Token Latency: https://github.com/mlcommons/inference/pull/1596#issuecomment-1920237469