mlcommons / ck

Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data, software and hardware
https://access.cKnowledge.org/challenges
Apache License 2.0
595 stars 109 forks source link

Automation and reproducibility for MLPerf Inference v3.1 #1052

Closed arjunsuresh closed 5 months ago

arjunsuresh commented 7 months ago

The MLCommons taskforce on automation and reproducibility is helping the community, vendors and submitters check if it is possible to re-run MLPerf inference v3.1 submissions, fix encountered issues and add their implementations to the MLCommons CM automation to run all MLPerf benchmark implementations in a unified way.

Note that CM is a collaborative project to run all MLPerf inference benchmarks on any platform with any software/hardware stack using a unified interface. The MLCommons CM interface and automations are being developed based on the feedback from MLPerf users and submitters - if you encounter some issues or have suggestions and feature requests, please report them either via GitHub issues, via our Discord channel or by providing a patch to CM automations. Thank you and looking forward to collaborating with you!

Model / Implementation Reference
Intel,Nvidia,AMD,ARM
Nvidia CUDA Intel QAIC DeepSparse
Intel,ARM,AMD
Google TPU
ResNet50 2
✅ - via CM
RetinaNet 2
✅ - via CM
Bert
3d-Unet
RNNT
DLRMv2
GPT-J TBD
Stable Diffusion
Llama2 Added to CM - looking for volunteers to test it

1- original docker container fails because of incompatibility with the latest PIP packages: see GitHub issue. We collaborate with Intel to integrate their patch with the CM automation and re-run their submissions - it's mostly done. 2❌- not possible to rerun and reproduce performance numbers due to missing configuration files: see GitHub issue. After discussing this issue with submitters, we helped them generate missing configuration files using MLCommons CM automation for QAIC and match QAIC performance numbers from v3.1 submission. It should be possible to use CM for QAIC MLPerf v4.0 inference submissions.

MLCommons CM interface

You should be able to run MLPerf inference benchmarks via unified CM interface and portable workflow that can run natively or inside automatically generated Docker container:

pip install cmind
cm pull repo mlcommons@ck
cmr "run common mlperf inference" --implementation=nvidia --model=bert-99

Prepare official submission for Edge category:

cmr "run common mlperf inference _submission _full" --implementation=nvidia --model=bert-99

Prepare official submission for DataCenter category:

cmr "run common mlperf inference _submission _full" --implementation=nvidia \
--model=bert-99 --category=datacenter --division=closed
gfursin commented 7 months ago

We got a feedback from submitters to create a GUI that can generate CM commands. I opened a ticket: https://github.com/mlcommons/ck/issues/1070

gfursin commented 7 months ago

[20240130] We had lots of great feedback and improved both generic CM automation recipes and CM workflows for MLPerf inference:

gfursin commented 7 months ago

We started discussing a proposal for MLPerf reproducibility badges similar to ACM/IEEE/NeurIPS conferences: https://github.com/mlcommons/ck/issues/1080 - feedback is welcome!

gfursin commented 6 months ago

Following the feedback from the MLPerf submitters, we have developed a prototype of a GUI to generate a command line to run MLPerf inference benchmarks for all main implementations (reference, Intel, Nvidia, Qualcomm, MIL and DeepSparse) and automate submissions. You can check it here. The long-term goal is to aggregate and encode all MLPerf submission rules and notes for all models, categories and divisions in this GUI.

We have also developed a prototype of a reproducibility infrastructure to keep track of successful MLPerf inference benchmark configurations across different MLPerf versions, hardware, implementations, models and backends based on the ACM/IEEE/cTuning reproducibility methodology and badging. You can see the last results here - we will continue adding more tests based on your suggestions including GPT-J, LLAMA2 and Stable Diffusion.

Our goal is to test as many v4.0 submissions as possible and add them to the above GUI to make it easier for the community to rerun experiments after the publication date. If some configurations are not working, we plan to help submitters fix issues.

gfursin commented 6 months ago

We improved CM automation for Intel, Nvidia and Qualcomm and added to GUI: https://access.cknowledge.org/playground/?action=howtorun&bench_uid=39877bb63fb54725 . We can re-run most of them now.

gfursin commented 5 months ago

We now have relatively stable common CM interface to rerun above submissions and reproduce key results. I close this ticket - we will open a similar ticket for inference v4.0 after publication. Huge thanks to colleagues from Intel, Qualcomm and Nvidia for their help and suggestions!