mlcommons / ck

Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data, software and hardware
https://access.cKnowledge.org
Apache License 2.0
584 stars 109 forks source link

CUDA version 12.4 not supported for this cm command #1243

Open EtienneMassart opened 2 months ago

EtienneMassart commented 2 months ago

Running this command from the cm playground gives an error message: cm run script --tags=run-mlperf,inference,_performance-only,_short \ --division=open \ --category=edge \ --device=cuda \ --model=gptj-99 \ --precision=float32 \ --implementation=nvidia \ --backend=tensorrt \ --scenario=Offline \ --execution_mode=test \ --power=no \ --adr.python.version_min=3.8 \ --clean \ --compliance=no \ --quiet \ --time

It requires to install libnccl2=2.18.3 which only supports CUDA 11.0 and 12.0-2. I tried changing the version installed by ~/CM/repos/mlcommons@ck/cm-mlops/script/install-nccl-libs/ but ran into an other error later: Building CXX object caffe2/CMakeFiles/op_registration_test.dir/__/aten/src/ATen/core/op_registration/op_registration_test.cpp.o ninja: build stopped: subcommand failed.

I don't know if it is related to the change I made but haven't found a fix for this.

arjunsuresh commented 1 month ago

Hi @EtienneMassart have you managed to solve this? cuda 12.4 is supported in CM now. Please follow this documentation for MLPerf inference which supports Nvidia and Reference implementations.