mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.2k stars 527 forks source link

Issue with 3d-unet #1845

Closed Agalakdak closed 2 weeks ago

Agalakdak commented 3 weeks ago

Hello everyone, I have already submitted a bug report here. But that topic got a lot of messages and I decided to create a new topic. This time I ran 3D unet using the command below from this site https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/

The command cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50

and a brief error report: 0.580 INFO:root: ! cd /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f 0.580 INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh 0.584 /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f 0.585 ** 0.585 Current directory: /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f 0.585 0.585 Cloning inference from https://github.com/mlcommons/inference 0.585 0.585 git clone -b master https://github.com/mlcommons/inference --depth 5 inference 0.585 0.586 Cloning into 'inference'... 38.68 fatal: the remote end hung up unexpectedly 38.69 fatal: early EOF 38.69 fatal: index-pack failed 38.69 Detected version: 3.8.10 38.69 Detected version: 3.8.10 38.69 38.69 CM error: Portable CM script failed (name = get-git-repo, return code = 256) 38.69 38.69 38.69 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 38.69 Note that it is often a portability issue of a third-party tool or a native script 38.69 wrapped and unified by this CM script (automation recipe). Please re-run 38.69 this script with --repro flag and report this issue with the original 38.69 command line, cm-repro directory and full log here: 38.69 38.69 https://github.com/mlcommons/cm4mlops/issues 38.69 38.69 The CM concept is to collaboratively fix such issues inside portable CM scripts 38.69 to make existing tools and native scripts more portable, interoperable 38.69 and deterministic. Thank you!

Full log with the problem

error_3dunet.log

arjunsuresh commented 3 weeks ago

That looks like a github connection issue. Can you please retry?

Agalakdak commented 3 weeks ago

Hi @arjunsuresh , thanks for the advice. I tried several times (about 4-5) and in 4 out of 5 cases after ~2000-3000 seconds I got the error shown above.

Finally I got some results (for other neural networks). Please answer the questions below. 1) My results: bert-99 +---------+----------+----------+------------+-----------------+ | Model | Scenario | Accuracy | Throughput | Latency (in ms) | +---------+----------+----------+------------+-----------------+ | bert-99 | Offline | 90.16951 | X 1764.6 | - | +---------+----------+----------+------------+-----------------+

resnet50

+----------+----------+----------+------------+-----------------+ | Model | Scenario | Accuracy | Throughput | Latency (in ms) | +----------+----------+----------+------------+-----------------+ | resnet50 | Offline | 76.034 | 19709.5 | - | +----------+----------+----------+------------+-----------------+

How should I interpret them and what should I compare them with? I found some tables here https://mlcommons.org/benchmarks/inference-edge/ . Did I understand correctly that Throughput is an analogue of "Samples". And what should I do with "Accuracy"?

2) I wanted to run resnet50 in single mode. But I got an error. The command: cm run script --tags=run-mlperf,inference,_r4.1-dev --model=resnet50 --implementation=nvidia --framework=tensorrt --category=edge --scenario=SingleStream --execution_mode=valid --device=cuda --quiet

I took the command here https://docs.mlcommons.org/inference/benchmarks/language/bert/#__tabbed_59_3

Log with error: resnet50_error.log

Agalakdak commented 3 weeks ago

@arjunsuresh Hi. I ran the command inside the container. cm run script --tags=run-mlperf,inference,_r4.1-dev --model=3d-unet-99 --implementation=nvidia --framework=tensorrt --category=edge --scenario=Offline --execution_mode=valid --device=cuda --quiet

And I got an error 3dunet_error.log

arjunsuresh commented 3 weeks ago

@Agalakdak Can you please open a separate issue for each model related query?

For R50 can you try adding this option? --env.SKIP_POLICIES=1

For the 3d-unet failing, can you please add --docker_cache=no to eliminate any issue with old cache?

While running in "closed" division accuracy will be above the threshold or else the submission checker will fail. For this reason, accuracy is not reported in the official results. In other words accuracy value of all the submissions are expected to be very close in the closed division and so only the performance number matter.

For throughput - yes, it is "samples per second" for most benchmarks and "tokens per second" for LLM ones.

Agalakdak commented 3 weeks ago

Hi @arjunsuresh, sorry for the late reply. I was busy with other things. I tried your advice and unfortunately got errors again. But this time I can provide the entire log of the first and second steps.

The first step is entering the command to go to the container. cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50

The second step is actually entering the command in the container itself. cm run script --tags=run-mlperf,inference,_r4.1-dev \ --model=sdxl \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=valid \ --device=cuda \ --quiet

3dunet_full_error_second_step.log 3dunet_full_error_first_step.log

arjunsuresh commented 3 weeks ago

Hi @Agalakdak The second command you shared is for sdxl but the logs are for 3d-unet. Is SDXL working fine? Let me check 3d-unet at my end.

arjunsuresh commented 3 weeks ago

It's working fine for me.

make preprocess_data BENCHMARKS='3d-unet'
/home/cmuser/CM/repos/local/cache/4ea5dceee2464cb7/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use `zoom` from the `scipy.ndimage` namespace, the `scipy.ndimage.interpolation` namespace is deprecated.
  from scipy.ndimage.interpolation import zoom
Preprocessing /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/data/KiTS19/kits19/data...
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00012.pkl -- shape (1, 256, 320, 320) mean [-1.8] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00044.pkl -- shape (1, 320, 384, 384) mean [-1.86] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00024.pkl -- shape (1, 256, 256, 256) mean [-1.66] std [1.17]
...

What I suspect is a failure in the download of the kits19 dataset as the below Nvidia script is skipping a redownload if the file already exist without checking its validity.

https://github.com/mlcommons/inference_results_v4.0/blob/main/closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh#L20

The below command will give you the path to the NVIDIA_SCRATCH to where the data gets downloaded. You manually remove the kits19 data directory from there and then retry the command.

cm run script "get mlperf inference nvidia scratch space _version.4_0" -j
arjunsuresh commented 3 weeks ago

Meanwhile kits19 download is slow and can take several hours to complete.

Agalakdak commented 2 weeks ago

Hello @arjunsuresh. I think I figured out what the problem is. It's a network issue... Below is the log case_00299: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 280551/280552 [00:48<00:00, 5728.77KB/s] Duplicating KITS19 case_00185 as case_00400... ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA Done. Downloading JSON files describing subset used for inference/calibration... --2024-09-17 02:52:31-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e Connecting to my_proxy_ip:8080... connected. Proxy request sent, awaiting response... 400 Bad Request 2024-09-17 02:52:32 ERROR 400: Bad Request.

--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json Connecting to my_proxy_ip:8080 connected. Proxy request sent, awaiting response... 404 No such domain 2024-09-17 02:52:32 ERROR 404: No such domain.

--2024-09-17 02:52:32-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e Connecting to my_proxy_ip:8080... connected. Proxy request sent, awaiting response... 400 Bad Request 2024-09-17 02:52:32 ERROR 400: Bad Request.

--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/calibration_cases.json Connecting to my_proxy_ip:8080... connected. Proxy request sent, awaiting response... 404 No such domain 2024-09-17 02:52:33 ERROR 404: No such domain.

Done. Finished downloading all the datasets! /home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use zoom from the scipy.ndimage namespace, the scipy.ndimage.interpolation namespace is deprecated. from scipy.ndimage.interpolation import zoom Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 858, in main() File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 842, in main kits19tool = KITS19Tool(args) File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 117, in init self.INFER_CASES = json.load(open(self.INFERENCE_CASE_FILE)) File "/usr/lib/python3.8/json/init.py", line 293, in load return loads(fp.read(), File "/usr/lib/python3.8/json/init.py", line 357, in loads return _default_decoder.decode(s) File "/usr/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) make: *** [/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/Makefile.data:36: preprocess_data] Error 1

CM error: Portable CM script failed (name = app-mlperf-inference-nvidia, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you! cmuser@85f58939130e

arjunsuresh commented 2 weeks ago

@Agalakdak Actually that looks like a problem with the download script where it is creating invalid URLs. It probably worked fine for me because some of the downloaded files were already present. We'll fix this issue in the script.

Agalakdak commented 2 weeks ago

@arjunsuresh If you need more information about my system, please let me know

arjunsuresh commented 2 weeks ago

Hi @Agalakdak Can you please do this (inside the container)

cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
git pull
cm rm cache --tags=_download_data -f

And retry the command?

Agalakdak commented 2 weeks ago

@arjunsuresh Hi, I tried the advice above. Didn't help. There aren't many logs, so I just duplicated them below.

cmuser@ccaeef79d72e:$ cd ~~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA cmuser@ccaeef79d72e:/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ git pull remote: Enumerating objects: 21, done. remote: Counting objects: 100% (21/21), done. remote: Compressing objects: 100% (12/12), done. remote: Total 13 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0) Unpacking objects: 100% (13/13), 2.32 KiB | 339.00 KiB/s, done. From https://github.com/GATEOverflow/inference_results_v4.0 c032f835c..7abca22ba main -> origin/main Updating c032f835c..7abca22ba Fast-forward closed/NVIDIA/Makefile.build | 2 -- closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh | 4 ++-- 2 files changed, 2 insertions(+), 4 deletions(-) cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f

CM error: artifact(s) not found! cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$

...

cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ source ~/cm-venv/bin/activate (cm-venv) cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f

CM error: artifact(s) not found!

arjunsuresh commented 2 weeks ago

Hi @Agalakdak Can you please try cm rm cache --tags=_preprocess_data -f instead?

Agalakdak commented 2 weeks ago

@arjunsuresh Hi, I tried to run the command in this order: 1) cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA 2) git pull 3) cm rm cache --tags=_download_data -f And on the 3rd step I got the error "cm: command not found"

I tried to run "cm rm cache --tags=_preprocess_data -f" right after entering the container. And the command completed successfully. But it did not give any result.

unet_error.log

arjunsuresh commented 2 weeks ago

Can you retry the original command? No need to do command number 3.

Agalakdak commented 2 weeks ago

@arjunsuresh Hi, I am constantly busy with all sorts of tasks, so it is not always possible to promptly collect the necessary logs. The log below is 1) Running the command cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=50 (and getting an error)

2) Running the command cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA

3) Running the command git pull

4) Running the command

cm run script --tags=run-mlperf,inference,_r4.1-dev \ --model=3d-unet-99 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=valid \ --device=cuda \ --quiet

(and getting an error)

Full log: unet_19_09_error.log

arjunsuresh commented 2 weeks ago

No worries. I have added some extra checks for existing stale files. Can you please do cm pull repo and just repeat the 4th command (both inside the container)?

Agalakdak commented 1 week ago

Hi @arjunsuresh , I repeated all the commands as I did above. I got the same result.

unet_24_09_error.log