mlcommons / ck

Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data, software and hardware
https://access.cKnowledge.org/challenges
Apache License 2.0
598 stars 110 forks source link

[Suggestion] MLPerf reproducibility/repeatability methodology from ACM/IEEE/NeurIPS? #1080

Open gfursin opened 7 months ago

gfursin commented 7 months ago

Following many recent discussions at MLCommons about improving the repeatability and reproducibility of MLPerf inference benchmarks, we suggest to look at similar initiatives at computer systems conferences (artifact evaluation and reproducibility initiatives) and maybe adopt their methodology and badges:

Our repeatability study for MLPerf inference v3.1 highlights similar repeatability issues to what we already saw in compiler, systems and ML conferences:

A potential solution is improve repeatability of MLPerf submissions (full reproducibility is probably too costly and impossible at this stage) by introducing MLPerf reproducibility badges similar to ACM reproducibility badges:

We can evaluate results after submission deadline and before the publication deadline, and assign badges to all results in the final table that is officially published. It may motivate everyone to improve the quality of their submission and get all such badges in the future instead of the community discovering such issues after MLPerf publication of results.

gfursin commented 6 months ago

We have developed a prototype infrastructure to track MLPerf configurations and give ACM badges: