mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.24k stars 536 forks source link

Clarification regarding how the accuracy.txt file is generated #1861

Open arjunsuresh opened 2 months ago

arjunsuresh commented 2 months ago

The submission generation rules for inference says that the accuracy.txt file should be generated from the accuracy scripts. My interpretation of this is that one should run the reference accuracy scripts stand alone using the logs from the accuracy run and obtain this accuracy.txt file and not dump the accuracy.txt file with in the implementation code. Is this the correct interpretation?

accuracy.txt # stdout of reference accuracy scripts
arjunsuresh commented 2 months ago

@psyhtest @ashwin @attafosu Can you please confirm?

attafosu commented 1 month ago

@arjunsuresh Yes, that's correct.

psyhtest commented 1 month ago

I can think of a situation when an implementer refactors/integrates a reference script into their own script. For example, the reference script may hardcode using /usr/bin/python3, while they may want to use /usr/local/bin/python3.8. In this case, we can probably request that no material changes should be done during such refactoring/integration, but not that the reference script must always be run stand alone?

arjunsuresh commented 1 month ago

Thank you @attafosu @psyhtest

@psyhtest yes, running the reference accuracy script standalone is fine I believe. But this is not that straightforward as it often requires the original dataset and so we do have some submissions where accuracy.txt is generated from the benchmark run itself without calling the reference script. We didn't see any accuracy issue when running the standalone script for those submissions, but I believe this should not be allowed.

psyhtest commented 1 month ago

@arjunsuresh

But you admit that in some cases it may not be straightforward:

yes, running the reference accuracy script standalone is fine I believe. But this is not that straightforward

So why would we disallow it in such cases?

arjunsuresh commented 1 month ago

@psyhtest I'm not telling to disallow running the reference accuracy script in a custom way - say like within another python file. But I don't think it is right to allow generation of the accuracy.txt file by mimicking the actions of the reference script - because it becomes hard to verify this for other people.

We face this issue specifically for automating DLRMv2 submissions where to generate the accuracy.txt file we need the day23 criteo dataset which is not possible to be downloaded in an non-interactive way. But if we are allowed to generate the accuracy.txt file from within the benchmark implementation we possibly do not need this file at all.

mrmhodak commented 3 days ago

@arjunsuresh to work on this