Closed Agalakdak closed 2 weeks ago
That looks like a github
connection issue. Can you please retry?
Hi @arjunsuresh , thanks for the advice. I tried several times (about 4-5) and in 4 out of 5 cases after ~2000-3000 seconds I got the error shown above.
Finally I got some results (for other neural networks). Please answer the questions below. 1) My results: bert-99 +---------+----------+----------+------------+-----------------+ | Model | Scenario | Accuracy | Throughput | Latency (in ms) | +---------+----------+----------+------------+-----------------+ | bert-99 | Offline | 90.16951 | X 1764.6 | - | +---------+----------+----------+------------+-----------------+
resnet50
+----------+----------+----------+------------+-----------------+ | Model | Scenario | Accuracy | Throughput | Latency (in ms) | +----------+----------+----------+------------+-----------------+ | resnet50 | Offline | 76.034 | 19709.5 | - | +----------+----------+----------+------------+-----------------+
How should I interpret them and what should I compare them with? I found some tables here https://mlcommons.org/benchmarks/inference-edge/ . Did I understand correctly that Throughput is an analogue of "Samples". And what should I do with "Accuracy"?
2) I wanted to run resnet50 in single mode. But I got an error. The command: cm run script --tags=run-mlperf,inference,_r4.1-dev --model=resnet50 --implementation=nvidia --framework=tensorrt --category=edge --scenario=SingleStream --execution_mode=valid --device=cuda --quiet
I took the command here https://docs.mlcommons.org/inference/benchmarks/language/bert/#__tabbed_59_3
Log with error: resnet50_error.log
@arjunsuresh Hi. I ran the command inside the container. cm run script --tags=run-mlperf,inference,_r4.1-dev --model=3d-unet-99 --implementation=nvidia --framework=tensorrt --category=edge --scenario=Offline --execution_mode=valid --device=cuda --quiet
And I got an error 3dunet_error.log
@Agalakdak Can you please open a separate issue for each model related query?
For R50 can you try adding this option? --env.SKIP_POLICIES=1
For the 3d-unet failing, can you please add --docker_cache=no
to eliminate any issue with old cache?
While running in "closed" division accuracy will be above the threshold or else the submission checker will fail. For this reason, accuracy is not reported in the official results. In other words accuracy value of all the submissions are expected to be very close in the closed division and so only the performance number matter.
For throughput - yes, it is "samples per second" for most benchmarks and "tokens per second" for LLM ones.
Hi @arjunsuresh, sorry for the late reply. I was busy with other things. I tried your advice and unfortunately got errors again. But this time I can provide the entire log of the first and second steps.
The first step is entering the command to go to the container. cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50
The second step is actually entering the command in the container itself. cm run script --tags=run-mlperf,inference,_r4.1-dev \ --model=sdxl \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=valid \ --device=cuda \ --quiet
3dunet_full_error_second_step.log 3dunet_full_error_first_step.log
Hi @Agalakdak The second command you shared is for sdxl
but the logs are for 3d-unet. Is SDXL working fine? Let me check 3d-unet at my end.
It's working fine for me.
make preprocess_data BENCHMARKS='3d-unet'
/home/cmuser/CM/repos/local/cache/4ea5dceee2464cb7/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use `zoom` from the `scipy.ndimage` namespace, the `scipy.ndimage.interpolation` namespace is deprecated.
from scipy.ndimage.interpolation import zoom
Preprocessing /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/data/KiTS19/kits19/data...
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00012.pkl -- shape (1, 256, 320, 320) mean [-1.8] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00044.pkl -- shape (1, 320, 384, 384) mean [-1.86] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00024.pkl -- shape (1, 256, 256, 256) mean [-1.66] std [1.17]
...
What I suspect is a failure in the download of the kits19 dataset as the below Nvidia script is skipping a redownload if the file already exist without checking its validity.
The below command will give you the path to the NVIDIA_SCRATCH to where the data gets downloaded. You manually remove the kits19 data directory from there and then retry the command.
cm run script "get mlperf inference nvidia scratch space _version.4_0" -j
Meanwhile kits19 download is slow and can take several hours to complete.
Hello @arjunsuresh. I think I figured out what the problem is. It's a network issue... Below is the log case_00299: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 280551/280552 [00:48<00:00, 5728.77KB/s] Duplicating KITS19 case_00185 as case_00400... ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA Done. Downloading JSON files describing subset used for inference/calibration... --2024-09-17 02:52:31-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e Connecting to my_proxy_ip:8080... connected. Proxy request sent, awaiting response... 400 Bad Request 2024-09-17 02:52:32 ERROR 400: Bad Request.
--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json Connecting to my_proxy_ip:8080 connected. Proxy request sent, awaiting response... 404 No such domain 2024-09-17 02:52:32 ERROR 404: No such domain.
--2024-09-17 02:52:32-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e Connecting to my_proxy_ip:8080... connected. Proxy request sent, awaiting response... 400 Bad Request 2024-09-17 02:52:32 ERROR 400: Bad Request.
--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/calibration_cases.json Connecting to my_proxy_ip:8080... connected. Proxy request sent, awaiting response... 404 No such domain 2024-09-17 02:52:33 ERROR 404: No such domain.
Done.
Finished downloading all the datasets!
/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use zoom
from the scipy.ndimage
namespace, the scipy.ndimage.interpolation
namespace is deprecated.
from scipy.ndimage.interpolation import zoom
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 858, in
CM error: Portable CM script failed (name = app-mlperf-inference-nvidia, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:
https://github.com/mlcommons/cm4mlops/issues
The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you! cmuser@85f58939130e
@Agalakdak Actually that looks like a problem with the download script where it is creating invalid URLs. It probably worked fine for me because some of the downloaded files were already present. We'll fix this issue in the script.
@arjunsuresh If you need more information about my system, please let me know
Hi @Agalakdak Can you please do this (inside the container)
cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
git pull
cm rm cache --tags=_download_data -f
And retry the command?
@arjunsuresh Hi, I tried the advice above. Didn't help. There aren't many logs, so I just duplicated them below.
cmuser@ccaeef79d72e:$ cd ~~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA cmuser@ccaeef79d72e:/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ git pull remote: Enumerating objects: 21, done. remote: Counting objects: 100% (21/21), done. remote: Compressing objects: 100% (12/12), done. remote: Total 13 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0) Unpacking objects: 100% (13/13), 2.32 KiB | 339.00 KiB/s, done. From https://github.com/GATEOverflow/inference_results_v4.0 c032f835c..7abca22ba main -> origin/main Updating c032f835c..7abca22ba Fast-forward closed/NVIDIA/Makefile.build | 2 -- closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh | 4 ++-- 2 files changed, 2 insertions(+), 4 deletions(-) cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f
CM error: artifact(s) not found! cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$
...
cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ source ~/cm-venv/bin/activate (cm-venv) cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f
CM error: artifact(s) not found!
Hi @Agalakdak Can you please try cm rm cache --tags=_preprocess_data -f
instead?
@arjunsuresh Hi, I tried to run the command in this order: 1) cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA 2) git pull 3) cm rm cache --tags=_download_data -f And on the 3rd step I got the error "cm: command not found"
I tried to run "cm rm cache --tags=_preprocess_data -f" right after entering the container. And the command completed successfully. But it did not give any result.
Can you retry the original command? No need to do command number 3.
@arjunsuresh Hi, I am constantly busy with all sorts of tasks, so it is not always possible to promptly collect the necessary logs. The log below is 1) Running the command cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50 (and getting an error)
2) Running the command cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
3) Running the command git pull
4) Running the command
cm run script --tags=run-mlperf,inference,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=valid \ --device=cuda \ --quiet
(and getting an error)
Full log: unet_19_09_error.log
No worries. I have added some extra checks for existing stale files. Can you please do cm pull repo
and just repeat the 4th command (both inside the container)?
Hi @arjunsuresh , I repeated all the commands as I did above. I got the same result.
Hello everyone, I have already submitted a bug report here. But that topic got a lot of messages and I decided to create a new topic. This time I ran 3D unet using the command below from this site https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/
The command cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50
and a brief error report: 0.580 INFO:root: ! cd /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f 0.580 INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh 0.584 /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f 0.585 ** 0.585 Current directory: /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f 0.585 0.585 Cloning inference from https://github.com/mlcommons/inference 0.585 0.585 git clone -b master https://github.com/mlcommons/inference --depth 5 inference 0.585 0.586 Cloning into 'inference'... 38.68 fatal: the remote end hung up unexpectedly 38.69 fatal: early EOF 38.69 fatal: index-pack failed 38.69 Detected version: 3.8.10 38.69 Detected version: 3.8.10 38.69 38.69 CM error: Portable CM script failed (name = get-git-repo, return code = 256) 38.69 38.69 38.69 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 38.69 Note that it is often a portability issue of a third-party tool or a native script 38.69 wrapped and unified by this CM script (automation recipe). Please re-run 38.69 this script with --repro flag and report this issue with the original 38.69 command line, cm-repro directory and full log here: 38.69 38.69 https://github.com/mlcommons/cm4mlops/issues 38.69 38.69 The CM concept is to collaboratively fix such issues inside portable CM scripts 38.69 to make existing tools and native scripts more portable, interoperable 38.69 and deterministic. Thank you!
Full log with the problem
error_3dunet.log