Open Agalakdak opened 3 months ago
I will donate 10$
Which test? Which errors?
I encountered so many errors that I don't know where to start.
For example, now I followed this document https://docs.mlcommons.org/inference/install/
Then I decided to go straight here https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/
And when I entered
cm run script --tags=install,python-venv --name=mlperf
I got the message CM error: automation script not found!
I found this error today, up to this point the instructions worked. The log is below user@user:~~$ mkdir test user@user:~~$ cd test/ user@user:~~/test$ python3 -m venv cm user@user:~~/test$ source cm/bin/activate (cm) user@user:~~/test$ pip install cm4mlops Collecting cm4mlops Using cached cm4mlops-0.2-py3-none-any.whl Collecting cmind Using cached cmind-2.3.5.tar.gz (63 kB) Preparing metadata (setup.py) ... done Collecting pyyaml Using cached PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) Collecting requests Using cached requests-2.32.3-py3-none-any.whl (64 kB) Collecting giturlparse Using cached giturlparse-0.12.0-py2.py3-none-any.whl (15 kB) Collecting setuptools>=60 Using cached setuptools-74.0.0-py3-none-any.whl (1.3 MB) Collecting wheel Using cached wheel-0.44.0-py3-none-any.whl (67 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB) Collecting idna<4,>=2.5 Using cached idna-3.8-py3-none-any.whl (66 kB) Collecting certifi>=2017.4.17 Downloading certifi-2024.8.30-py3-none-any.whl (167 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 167.3/167.3 KB 1.3 MB/s eta 0:00:00 Collecting urllib3<3,>=1.21.1 Using cached urllib3-2.2.2-py3-none-any.whl (121 kB) Using legacy 'setup.py install' for cmind, since package 'wheel' is not installed. Installing collected packages: wheel, urllib3, setuptools, pyyaml, idna, giturlparse, charset-normalizer, certifi, requests, cmind, cm4mlops Attempting uninstall: setuptools Found existing installation: setuptools 59.6.0 Uninstalling setuptools-59.6.0: Successfully uninstalled setuptools-59.6.0 Running setup.py install for cmind ... done Successfully installed certifi-2024.8.30 charset-normalizer-3.3.2 cm4mlops-0.2 cmind-2.3.5 giturlparse-0.12.0 idna-3.8 pyyaml-6.0.2 requests-2.32.3 setuptools-74.0.0 urllib3-2.2.2 wheel-0.44.0 (cm) user@user:~/test$ cm run script --tags=install,python-venv --name=mlperf
CM error: automation script not found! (cm) user@user:~/test$
I ran this script
cm run --tags=run-mlperf,inference,_find-performance,_full,_r4.0 --model=3d-unet-99 --implementation=intel --framework=pytorch --category=edge - -scenario=Offline --execution_mode=test --device=cpu --quiet --test_query_count=50
There is no libffi7 package on my Ubuntu 23.04 Log below
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y libffi7 Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to find package libffi7
CM error: Portable CM script failed (name = get-generic-sys-util, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:
https://github.com/mlcommons/cm4mlops/issues
The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!
Hi @Agalakdak The docs page uses docker
option as the default to avoid such OS dependent issues. Is there a reason you don't want to use docker?
Hi @arjunsuresh ! Sorry for the long reply, I tried different ways to solve it. I used the command
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.0 --model=3d-unet-99 --implementation=intel --framework=pytorch --category=edge --scenario=Offline --execution_mode=test --device=cpu --docker --quiet --test_query_count=50
From here https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/
And at the last step I got an error
129.8 /home/cmuser/CM/repos/local/cache/b6acf79e843b4c1e/miniconda3/bin/conda install -y -c intel mkl-include
130.2 Collecting package metadata (current_repodata.json): ...working... failed
132.3
132.3 UnavailableInvalidChannel: HTTP 403 FORBIDDEN for channel intel https://conda.anaconda.org/intel
132.3
132.3 The channel is not accessible or is invalid.
132.3
132.3 You will need to adjust your conda configuration to proceed.
132.3 Use conda config --show channels
to view your configuration's current state,
132.3 and use conda config --show-sources
to view config file locations.
132.3
132.3
132.5 Detected version: 3.10.12
132.5 Detected version: 3.10.12
132.5 Detected version: 22.0.2
132.5
132.5 Extra PIP CMD:
132.5
132.5 Detected version: 3.0.0
132.5 Detected version: 24.7.1
132.5 Detected version: 3.8.0
132.5
132.5 CM error: Portable CM script failed (name = install-generic-conda-package, return code = 256)
And some more logs in the file error_with_docker.txt
I don't understand what to do with this. What information can I provide you to solve the problem?
@arjunsuresh, I tried to run another benchmark. But there was an error there too. Please help me figure it out.
Command cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.0 \ --model=retinanet \ --implementation=intel \ --framework=pytorch \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cpu \ --docker --quiet \ --test_query_count=100
Error log 1762.7 environment: line 1: 51417 Killed ${CM_PYTHON_BIN_WITH_PATH} "$@" 1762.8 1762.8 CM error: Portable CM script failed (name = get-dataset-openimages, return code = 256) 1762.8 1762.8 1762.8 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1762.8 Note that it is often a portability issue of a third-party tool or a native script 1762.8 wrapped and unified by this CM script (automation recipe). Please re-run 1762.8 this script with --repro flag and report this issue with the original 1762.8 command line, cm-repro directory and full log here: 1762.8 1762.8 https://github.com/mlcommons/cm4mlops/issues 1762.8 1762.8 The CM concept is to collaboratively fix such issues inside portable CM scripts 1762.8 to make existing tools and native scripts more portable, interoperable 1762.8 and deterministic. Thank you! 1762.8 1762.8 1762.8 Using MLCommons Inference source from '/home/cmuser/CM/repos/local/cache/f93090d427b8435f/inference' 1762.8
1 warning found (use docker --debug to expand):
Full error log
error_with_docker2.txt
Hi @Agalakdak we do have problem with Intel implementation as reported here. We'll work with Intel to fix these. But even then Intel implementation is expected to work on only the latest Intel server/workstation CPUs - we'll update this in the documentation.
Hi @arjunsuresh, Hi arjunsuresh, thanks for the prompt reply. I'll check that code on the intel XEON GOLD 6346 x2 processor a little later
@arjunsuresh Can I clarify for the future? Are there any problems with the "Quadro RTX 5000" and "Nvidia A40" video cards?
@Agalakdak Nvidia doesn't officially support them for MLPerf inference. But typically we have had good success running Nvidia code on such GPUs without much difficulty. Do you have a plan of what all you are trying to benchmark?
@arjunsuresh Yes, sure. First, I'd like to just run one of the inference benchmarks and compare them with the "reference indicators". If the launch is successful, I'll try to run a benchmark for "training" the network on several video cards using a docker container. And then use these results to find bottlenecks in the system (if there are any)
Today I'll try to run as many benchmarks as possible. And then I'll write about the results. If you need any additional information about the system, let me know.
@Agalakdak If you want to run as many benchmarks as possible the best option to start with will be using Nvidia implementation. Even if any issue is there, they are usually quickly resolvable. If you just want to try getting a result, reference implementation is good for smaller models like resnet50 and bert-99 as it runs on almost any CPUs.
And if you are referring to MLPerf training benchmarks - hat's very different from inference even though many of the models in inference come from MLPerf training. Currently there is no automated way to run training benchmarks and the only option is to follow the submitter READMEs is the results repository. https://github.com/mlcommons/training_results_v4.0
@arjunsuresh Oh, thanks a lot for the help, but I'm afraid I have another question. I successfully ran "Text to Image using Stable Diffusion" cm run script --tags=run-mlperf,inference,_r4.1 \ --model=sdxl \ --implementation=reference \ --framework=pytorch \ --category=edge \ --scenario=Offline \ --execution_mode=valid \ --device=cuda \ --quiet My GPU did do some work. But... After all this I ended up inside a container and can't find any results, neither in the container nor in the logs
@Agalakdak that's only the first step. You need to do the following command from the documentation page inside that docker container.
Hello @arjunsuresh , unfortunately a new day and new problems On one of the systems I started SD (and the test has been running for several hours, I don't know if I need to stop it forcibly or it will handle it itself?), but on the other one it doesn't. I ran the same command
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=sdxl \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50
What can I do with this error?
/usr/include/x86_64-linux-gnu/bits/mathcalls.h(110): error: identifier "_Float32" is undefined Error limit reached. 100 errors detected in the compilation of "print_cuda_devices.cu". Compilation terminated.
CM error: Portable CM script failed (name = get-cuda-devices, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third- party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here: https://github.com/mlcommons/cm4mlops/issues The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!
Full log with error
server_err_sd_1.log
@Agalakdak the problem is due to CUDA compilation not working on the host machine. This is actually not a necessity though we never had such an issue before. Let me share you the option to skip this.
@anandhu-eng are you able to share this option?
@arjunsuresh Hi, I encountered a similar problem when I wanted to run ResNet50 I launched: cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=resnet50 \ --implementation=reference \ --framework=onnxruntime \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=1000
and it was ok.
And it worked fine - I got inside the container. In the container I ran cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios --model=resnet50 --implementation=reference --framework=onnxruntime --category=edge --execution_mode=valid --device=cuda --quiet
And I got (I assume) a similar error.
INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-cuda-devices/run.sh from tmp-run.sh rm: cannot remove 'a.out': No such file or directory
Checking compiler version ...
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0
Compiling program ...
Running program ...
/home/cmuser INFO:root:======================================================== INFO:root:Print file tmp-run.out: INFO:root: INFO:root:Error: problem obtaining number of CUDA devices: 100
INFO:root:
CM error: Portable CM script failed (name = get-cuda-devices, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:
https://github.com/mlcommons/cm4mlops/issues
The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!
Log with error error_resnet50_docker_log.txt
Hi @Agalakdak We also sometimes face the below error while using Nvidia GPUs inside a container
INFO:root:Error: problem obtaining number of CUDA devices: 100
A quick fix for this is to exit the container. Use docker ps -a
to get the container ID say id
. Then do docker start id && docker attach id
and we should be back where we were but with working Nvidia GPUs.
We have also removed the requirement to have NVCC in the host system - please do cm pull repo
and you should be able to run sdxl
.
Hello. I have been trying to run at least 1 test for a long time and I constantly get errors. Please record a video or give me a link so that I can understand what a normal launch without pain should look like.