Suggestions to improve CM-MLPerf inference after SCC'23

mlcommons / ck

Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data, software and hardware

https://access.cKnowledge.org/challenges

Apache License 2.0

595 stars 109 forks source link

Suggestions to improve CM-MLPerf inference after SCC'23 #1006

Closed gfursin closed 6 months ago

gfursin commented 9 months ago

While finalizing CM-MLPerf BERT inference benchmark tutorial for SCC'23 here a few missing things that we can do later if/when we have time and resources:

[ ] Check that CM can register large BERT models from local drive / NFS rather than downloading them
[ ] Run reference implementation (CPU & GPU) via Docker
- [ ] add support for any model
[ ] Run DeepSparse implementation via Docker
- [ ] add support for any local model

Save more info to the SCC'23 dashboard: https://wandb.ai/cmind/cm-mlperf-scc23-bert-offline/workspace

[x] Save all python deps including python itself (what is the command)?
[x] Save input flags
- [x] config_misc (anything by user in dict)
- [x] precision?
- [x] model_variation (particularly Hugging Face and DeepSparse)
- [x] batch_size (particularly during cm run experiment)
[x] Collect all deps versions:
- [x] Python
- [x] NNX, PyTorch, TensorFlow version
- [x] DeepSparse engine version
- [x] CUDA driver/CUDA toolkit/cuDNN/TensorRT

gfursin commented 9 months ago

More feedback from SCC'23 teams about CM-MLPerf automation:

[x] Provide a tutorial how to debug CM scripts/workflows - maybe we can make a simple video tutorial?
[x] Provide a list of all main flags for CM script to run MLPerf inference benchmark
[ ] Describe all compatible flags and variations for CM script to run MLPerf inference benchmark
[ ] Provide a description of a used model (and/or a link to original paper) and how to optimize it (quantize/prune/etc)
[ ] Improve logging in CM (particularly when CM is in interactive mode): https://github.com/mlcommons/ck/issues/1017
[ ] Visualize all dependencies in a CM-MLPerf script: https://github.com/mlcommons/ck/issues/1018

gfursin commented 9 months ago

Another idea is to stop CM just before running the command and show how we prepared the command line and which cache entries are used (with inference sources, model, dataset, loadgen, etc) ... This is what --debug flag already does but we may mention it explicitly.

Also, we should simplify tutorial and only describe the CM command to run full MLPerf inference benchmark out-of-the-box with all the flags that can customize execution - we plan to prepare a higher-level CM-MLPerf launcher to do that.

gfursin commented 9 months ago

Yet another suggestion is to make it clear that CM is not a "reproducibility" tool but a workflow automation tool that attempts adapt benchmark/applications to continuously changing software and hardware. When combined with Docker, we can ensure that we can run a benchmark (even though it doesn't guarantee reproducibility of performance numbers since we may not have a control to pin threads to specific cores, control NUMA, set frequency of different cores, etc). But CM users want to run MLPerf with new software/hardware even if it fails, because we can then collaboratively improve CM scripts and make them more portable and reproducible as a community effort - we should highlight it in the tutorial!

gfursin commented 6 months ago

We added support for most features from here in CM v2+. I am closing this ticket and we will open a new one with a few pending features if needed.