mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.24k stars 536 forks source link

Make a video #1840

Open Agalakdak opened 3 months ago

Agalakdak commented 3 months ago

Hello. I have been trying to run at least 1 test for a long time and I constantly get errors. Please record a video or give me a link so that I can understand what a normal launch without pain should look like.

Agalakdak commented 3 months ago

I will donate 10$

psyhtest commented 3 months ago

Which test? Which errors?

Agalakdak commented 3 months ago

I encountered so many errors that I don't know where to start. For example, now I followed this document https://docs.mlcommons.org/inference/install/ Then I decided to go straight here https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/ And when I entered cm run script --tags=install,python-venv --name=mlperf

I got the message CM error: automation script not found!

I found this error today, up to this point the instructions worked. The log is below user@user:~~$ mkdir test user@user:~~$ cd test/ user@user:~~/test$ python3 -m venv cm user@user:~~/test$ source cm/bin/activate (cm) user@user:~~/test$ pip install cm4mlops Collecting cm4mlops Using cached cm4mlops-0.2-py3-none-any.whl Collecting cmind Using cached cmind-2.3.5.tar.gz (63 kB) Preparing metadata (setup.py) ... done Collecting pyyaml Using cached PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) Collecting requests Using cached requests-2.32.3-py3-none-any.whl (64 kB) Collecting giturlparse Using cached giturlparse-0.12.0-py2.py3-none-any.whl (15 kB) Collecting setuptools>=60 Using cached setuptools-74.0.0-py3-none-any.whl (1.3 MB) Collecting wheel Using cached wheel-0.44.0-py3-none-any.whl (67 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB) Collecting idna<4,>=2.5 Using cached idna-3.8-py3-none-any.whl (66 kB) Collecting certifi>=2017.4.17 Downloading certifi-2024.8.30-py3-none-any.whl (167 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 167.3/167.3 KB 1.3 MB/s eta 0:00:00 Collecting urllib3<3,>=1.21.1 Using cached urllib3-2.2.2-py3-none-any.whl (121 kB) Using legacy 'setup.py install' for cmind, since package 'wheel' is not installed. Installing collected packages: wheel, urllib3, setuptools, pyyaml, idna, giturlparse, charset-normalizer, certifi, requests, cmind, cm4mlops Attempting uninstall: setuptools Found existing installation: setuptools 59.6.0 Uninstalling setuptools-59.6.0: Successfully uninstalled setuptools-59.6.0 Running setup.py install for cmind ... done Successfully installed certifi-2024.8.30 charset-normalizer-3.3.2 cm4mlops-0.2 cmind-2.3.5 giturlparse-0.12.0 idna-3.8 pyyaml-6.0.2 requests-2.32.3 setuptools-74.0.0 urllib3-2.2.2 wheel-0.44.0 (cm) user@user:~/test$ cm run script --tags=install,python-venv --name=mlperf

CM error: automation script not found! (cm) user@user:~/test$

Agalakdak commented 3 months ago

I ran this script

cm run --tags=run-mlperf,inference,_find-performance,_full,_r4.0 --model=3d-unet-99 --implementation=intel --framework=pytorch --category=edge - -scenario=Offline --execution_mode=test --device=cpu --quiet --test_query_count=50

There is no libffi7 package on my Ubuntu 23.04 Log below

sudo DEBIAN_FRONTEND=noninteractive apt-get install -y libffi7 Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to find package libffi7

CM error: Portable CM script failed (name = get-generic-sys-util, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!

arjunsuresh commented 2 months ago

Hi @Agalakdak The docs page uses docker option as the default to avoid such OS dependent issues. Is there a reason you don't want to use docker?

Agalakdak commented 2 months ago

Hi @arjunsuresh ! Sorry for the long reply, I tried different ways to solve it. I used the command

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.0 --model=3d-unet-99 --implementation=intel --framework=pytorch --category=edge --scenario=Offline --execution_mode=test --device=cpu --docker --quiet --test_query_count=50

From here https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/

And at the last step I got an error 129.8 /home/cmuser/CM/repos/local/cache/b6acf79e843b4c1e/miniconda3/bin/conda install -y -c intel mkl-include 130.2 Collecting package metadata (current_repodata.json): ...working... failed 132.3 132.3 UnavailableInvalidChannel: HTTP 403 FORBIDDEN for channel intel https://conda.anaconda.org/intel 132.3 132.3 The channel is not accessible or is invalid. 132.3 132.3 You will need to adjust your conda configuration to proceed. 132.3 Use conda config --show channels to view your configuration's current state, 132.3 and use conda config --show-sources to view config file locations. 132.3 132.3 132.5 Detected version: 3.10.12 132.5 Detected version: 3.10.12 132.5 Detected version: 22.0.2 132.5 132.5 Extra PIP CMD: 132.5 132.5 Detected version: 3.0.0 132.5 Detected version: 24.7.1 132.5 Detected version: 3.8.0 132.5 132.5 CM error: Portable CM script failed (name = install-generic-conda-package, return code = 256)

And some more logs in the file error_with_docker.txt

I don't understand what to do with this. What information can I provide you to solve the problem?

Agalakdak commented 2 months ago

@arjunsuresh, I tried to run another benchmark. But there was an error there too. Please help me figure it out.

Command cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.0 \ --model=retinanet \ --implementation=intel \ --framework=pytorch \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cpu \ --docker --quiet \ --test_query_count=100

Error log 1762.7 environment: line 1: 51417 Killed ${CM_PYTHON_BIN_WITH_PATH} "$@" 1762.8 1762.8 CM error: Portable CM script failed (name = get-dataset-openimages, return code = 256) 1762.8 1762.8 1762.8 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1762.8 Note that it is often a portability issue of a third-party tool or a native script 1762.8 wrapped and unified by this CM script (automation recipe). Please re-run 1762.8 this script with --repro flag and report this issue with the original 1762.8 command line, cm-repro directory and full log here: 1762.8 1762.8 https://github.com/mlcommons/cm4mlops/issues 1762.8 1762.8 The CM concept is to collaboratively fix such issues inside portable CM scripts 1762.8 to make existing tools and native scripts more portable, interoperable 1762.8 and deterministic. Thank you! 1762.8 1762.8 1762.8 Using MLCommons Inference source from '/home/cmuser/CM/repos/local/cache/f93090d427b8435f/inference' 1762.8


1 warning found (use docker --debug to expand):

Full error log
error_with_docker2.txt

arjunsuresh commented 2 months ago

Hi @Agalakdak we do have problem with Intel implementation as reported here. We'll work with Intel to fix these. But even then Intel implementation is expected to work on only the latest Intel server/workstation CPUs - we'll update this in the documentation.

Agalakdak commented 2 months ago

Hi @arjunsuresh, Hi arjunsuresh, thanks for the prompt reply. I'll check that code on the intel XEON GOLD 6346 x2 processor a little later

Agalakdak commented 2 months ago

@arjunsuresh Can I clarify for the future? Are there any problems with the "Quadro RTX 5000" and "Nvidia A40" video cards?

arjunsuresh commented 2 months ago

@Agalakdak Nvidia doesn't officially support them for MLPerf inference. But typically we have had good success running Nvidia code on such GPUs without much difficulty. Do you have a plan of what all you are trying to benchmark?

Agalakdak commented 2 months ago

@arjunsuresh Yes, sure. First, I'd like to just run one of the inference benchmarks and compare them with the "reference indicators". If the launch is successful, I'll try to run a benchmark for "training" the network on several video cards using a docker container. And then use these results to find bottlenecks in the system (if there are any)

Today I'll try to run as many benchmarks as possible. And then I'll write about the results. If you need any additional information about the system, let me know.

arjunsuresh commented 2 months ago

@Agalakdak If you want to run as many benchmarks as possible the best option to start with will be using Nvidia implementation. Even if any issue is there, they are usually quickly resolvable. If you just want to try getting a result, reference implementation is good for smaller models like resnet50 and bert-99 as it runs on almost any CPUs.

And if you are referring to MLPerf training benchmarks - hat's very different from inference even though many of the models in inference come from MLPerf training. Currently there is no automated way to run training benchmarks and the only option is to follow the submitter READMEs is the results repository. https://github.com/mlcommons/training_results_v4.0

Agalakdak commented 2 months ago

@arjunsuresh Oh, thanks a lot for the help, but I'm afraid I have another question. I successfully ran "Text to Image using Stable Diffusion" cm run script --tags=run-mlperf,inference,_r4.1 \ --model=sdxl \ --implementation=reference \ --framework=pytorch \ --category=edge \ --scenario=Offline \ --execution_mode=valid \ --device=cuda \ --quiet My GPU did do some work. But... After all this I ended up inside a container and can't find any results, neither in the container nor in the logs

error_with_docker3.txt

arjunsuresh commented 2 months ago

@Agalakdak that's only the first step. You need to do the following command from the documentation page inside that docker container.

Agalakdak commented 2 months ago

Hello @arjunsuresh , unfortunately a new day and new problems On one of the systems I started SD (and the test has been running for several hours, I don't know if I need to stop it forcibly or it will handle it itself?), but on the other one it doesn't. I ran the same command

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=sdxl \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50

What can I do with this error?

/usr/include/x86_64-linux-gnu/bits/mathcalls.h(110): error: identifier "_Float32" is undefined Error limit reached. 100 errors detected in the compilation of "print_cuda_devices.cu". Compilation terminated.

CM error: Portable CM script failed (name = get-cuda-devices, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third- party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here: https://github.com/mlcommons/cm4mlops/issues The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!

Full log with error
server_err_sd_1.log

arjunsuresh commented 2 months ago

@Agalakdak the problem is due to CUDA compilation not working on the host machine. This is actually not a necessity though we never had such an issue before. Let me share you the option to skip this.

@anandhu-eng are you able to share this option?

Agalakdak commented 2 months ago

@arjunsuresh Hi, I encountered a similar problem when I wanted to run ResNet50 I launched: cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=resnet50 \ --implementation=reference \ --framework=onnxruntime \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=1000

and it was ok.

And it worked fine - I got inside the container. In the container I ran cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios --model=resnet50 --implementation=reference --framework=onnxruntime --category=edge --execution_mode=valid --device=cuda --quiet

And I got (I assume) a similar error.

INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-cuda-devices/run.sh from tmp-run.sh rm: cannot remove 'a.out': No such file or directory

Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0

Compiling program ...

Running program ...

/home/cmuser INFO:root:======================================================== INFO:root:Print file tmp-run.out: INFO:root: INFO:root:Error: problem obtaining number of CUDA devices: 100

INFO:root:

CM error: Portable CM script failed (name = get-cuda-devices, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!

Log with error error_resnet50_docker_log.txt

arjunsuresh commented 2 months ago

Hi @Agalakdak We also sometimes face the below error while using Nvidia GPUs inside a container

INFO:root:Error: problem obtaining number of CUDA devices: 100

A quick fix for this is to exit the container. Use docker ps -a to get the container ID say id. Then do docker start id && docker attach id and we should be back where we were but with working Nvidia GPUs.

We have also removed the requirement to have NVCC in the host system - please do cm pull repo and you should be able to run sdxl.