Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data, software and hardware
Running this command from the cm playground gives an error message:
cm run script --tags=run-mlperf,inference,_performance-only,_short \
--division=open \
--category=edge \
--device=cuda \
--model=gptj-99 \
--precision=float32 \
--implementation=nvidia \
--backend=tensorrt \
--scenario=Offline \
--execution_mode=test \
--power=no \
--adr.python.version_min=3.8 \
--clean \
--compliance=no \
--quiet \
--time
It requires to install libnccl2=2.18.3 which only supports CUDA 11.0 and 12.0-2. I tried changing the version installed by ~/CM/repos/mlcommons@ck/cm-mlops/script/install-nccl-libs/ but ran into an other error later:
Building CXX object caffe2/CMakeFiles/op_registration_test.dir/__/aten/src/ATen/core/op_registration/op_registration_test.cpp.o
ninja: build stopped: subcommand failed.
I don't know if it is related to the change I made but haven't found a fix for this.
Hi @EtienneMassart have you managed to solve this? cuda 12.4 is supported in CM now. Please follow this documentation for MLPerf inference which supports Nvidia and Reference implementations.
Running this command from the cm playground gives an error message: cm run script --tags=run-mlperf,inference,_performance-only,_short \ --division=open \ --category=edge \ --device=cuda \ --model=gptj-99 \ --precision=float32 \ --implementation=nvidia \ --backend=tensorrt \ --scenario=Offline \ --execution_mode=test \ --power=no \ --adr.python.version_min=3.8 \ --clean \ --compliance=no \ --quiet \ --time
It requires to install libnccl2=2.18.3 which only supports CUDA 11.0 and 12.0-2. I tried changing the version installed by ~/CM/repos/mlcommons@ck/cm-mlops/script/install-nccl-libs/ but ran into an other error later: Building CXX object caffe2/CMakeFiles/op_registration_test.dir/__/aten/src/ATen/core/op_registration/op_registration_test.cpp.o ninja: build stopped: subcommand failed.
I don't know if it is related to the change I made but haven't found a fix for this.