mlcommons / cm4mlops

A collection of portable, reusable and cross-platform automation recipes (CM scripts) to make it easier to build and benchmark AI systems across diverse models, data sets, software and hardware
http://docs.mlcommons.org/cm4mlops/
Apache License 2.0
10 stars 13 forks source link

Inaccurate system information generated by CM submission tree script #364

Open rysc3 opened 1 week ago

rysc3 commented 1 week ago

There are some inaccuracies in the information that the script generates, some more important than others I believe. For example, it sets our OS to ubuntu since that is the operating system inside the container despite us running rocky outside of the container ( I figure this is not a big deal), more importantly when I generate results using the given default cm run offline script for both base and main for scc24:

https://docs.mlcommons.org/cm4mlperf-inference/

it ends up saying we are using 3x H100 NVL. Our system has 4x H100 NVL, and they are all accessible. nvidia-smi yields the correct result inside the container and at multiple steps during runtime we can see when it iterates over cuda devices and lists indexes 0..3.

Furthermore, I'm not sure if this is the expected behavior or an error but by defauly without making any new configurations shouldn't it be running on only a single gpu and record that as the result accordingly? I've run it manually and monitored and verified that it is indeed only ever using the same GPU (index 0), so I would think it should then report only a single H100 being utilized?

Either way, I figure this should be reporting 1x or 4x. The cm run scripts I'm referencing are here:

https://docs.mlcommons.org/inference/benchmarks/text_to_image/reproducibility/scc24/

And note my submissions here to see the 3x H100s as mentioned on the leaderboard: https://docs.mlcommons.org/cm4mlperf-inference/

arjunsuresh commented 1 week ago

Hi @rysc3 , yes that's a bug and should be fixed here.

Running on a single GPU - this is happening with the reference implementation right? Actually that's a problem with the reference implementation and there will be points if you can make it run using all the GPUs and give a PR to the inference repository.

For Nvidia implementation - all GPUs are expected to be used.

rysc3 commented 1 week ago

Yes, this is when using the reference implementation.