Missing 3d-unet data files for NVIDIA/

mahmoodn commented 2 years ago

Hi It seems that the download_data and download_model of 3d-unet in NVIDIA folder is not complete. As you can see below, all data and model files are downloaded:

Data:

(mlperf) mahmood@mlperf-inference-mahmood-x86_64-1654782240:/work$ make download_data BENCHMARKS="3d-unet-99"
Valid KITS RAW data set found in /disk1/scratch/data/KiTS19/kits19/data/, skipping download.
Downloading JSON files describing subset used for inference/calibration...
--2022-06-09 21:27:00--  https://raw.githubusercontent.com/mlcommons/inference/master/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 776 [text/plain]
Saving to: ‘/disk1/scratch/data/KiTS19/inference_cases.json’

/disk1/scratch/data/KiTS1 100%[==================================>]     776  --.-KB/s    in 0s      

2022-06-09 21:27:00 (81.2 MB/s) - ‘/disk1/scratch/data/KiTS19/inference_cases.json’ saved [776/776]

--2022-06-09 21:27:00--  https://raw.githubusercontent.com/mlcommons/inference/master/vision/medical_imaging/3d-unet-kits19/meta/calibration_cases.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 362 [text/plain]
Saving to: ‘/disk1/scratch/data/KiTS19/calibration_cases.json’

/disk1/scratch/data/KiTS1 100%[==================================>]     362  --.-KB/s    in 0s      

2022-06-09 21:27:00 (23.6 MB/s) - ‘/disk1/scratch/data/KiTS19/calibration_cases.json’ saved [362/362]

Done.
Finished downloading all the datasets!

Model:

(mlperf) mahmood@mlperf-inference-mahmood-x86_64-1654782240:/work$ make download_model BENCHMARKS="3d-unet-99"
bash code/3d-unet-99/tensorrt/download_model.sh && \
    echo "Finished downloading all the models!"
Downloading 3d-unet-kits19 models...
--2022-06-09 21:24:57--  https://zenodo.org/record/5597155/files/3dunet_kits19_128x128x128_dynbatch.onnx?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 124769203 (119M) [application/octet-stream]
Saving to: ‘/disk1/scratch/models/3d-unet-kits19/3dUNetKiTS19.onnx’

/disk1/scratch/models/3d- 100%[==================================>] 118.99M  9.34MB/s    in 12s     

2022-06-09 21:25:10 (9.67 MB/s) - ‘/disk1/scratch/models/3d-unet-kits19/3dUNetKiTS19.onnx’ saved [124769203/124769203]

Saved 3d-unet-kits19 models to /disk1/scratch/models/3d-unet-kits19/3dUNetKiTS19.onnx!
Finished downloading all the models!

However, the preprocess script needs case_00400 which is not present:

(mlperf) mahmood@mlperf-inference-mahmood-x86_64-1654782240:/work$ make preprocess_data BENCHMARKS="3d-unet"
Preprocessing /disk1/scratch/data/KiTS19/kits19/data...
Traceback (most recent call last):
  File "code/3d-unet/tensorrt/preprocess_data.py", line 886, in <module>
    main()
  File "code/3d-unet/tensorrt/preprocess_data.py", line 873, in main
    preprocess_kits19_raw_data(kits19tool)
  File "code/3d-unet/tensorrt/preprocess_data.py", line 502, in preprocess_kits19_raw_data
    preprocess_ref_with_multiproc(kits19tool)
  File "code/3d-unet/tensorrt/preprocess_data.py", line 481, in preprocess_ref_with_multiproc
    cases = preproc.collect_cases()
  File "code/3d-unet/tensorrt/preprocess_data.py", line 255, in collect_cases
    assert collected_set == target_set,\
AssertionError: Some of the target inference cases were NOT found: {'case_00400'}
make: *** [Makefile:464: preprocess_data] Error 1

In fact there is no such file. The cases are numbered from case_0000 to case_00299.

nv-jinhosuh commented 2 years ago

Hi,

There is a change targeting for v2.1, where one sample is duplicated to address Early Stopping related skewed latency. There is minor bug in timing of duplicating the sample, i.e. the script tries to duplicate while downloading the dataset. This bug is fixed and recently merged: https://github.com/mlcommons/inference/pull/1160

That being said, this is for v2.1 submission. If you are reproducing the results for v2.0, please clone the mlcommons/inference for v2.0 -- https://github.com/mlcommons/inference/tree/r2.0 This branch should not have the above ES related change.

mahmoodn commented 2 years ago

Thanks for the reply. I am using the NVIDIA's v2.0 folder (here). I didn't find that Makefile you mentioned in the issues. Seems that mlcommons/inference is different from closed/NVIDIA which I am using.

May I know if there is a way to fix the download_data.sh or download_model.sh (here)?

mahmoodn commented 2 years ago

According to get_imgaing.py it iterates from 0 to 299. Then in the preprocess_data.py, there is an assertion which says

assert collected_set == target_set

I expect that collected_set is 300 while target_set is 400. If I can change target_set to 300, I guess it will be fine. But, I wasn't able to find that.

nv-jinhosuh commented 2 years ago

Hi @mahmoodn So, as in this PR: https://github.com/mlcommons/inference/pull/1147/files for v2.1, we are duplicating sample case_00185 to case_00400 and as you said, the original KiTS19 doesn't have cases whose ID is above 300. This makes me believe your repo somehow is out of sync; mlcommons/inference and NVIDIA's repo both needs to be for v2.0. If you do make build from NVIDIA's repo, I expect it to clone the correct inference snapshot as in: https://github.com/mlcommons/inference_results_v2.0/blob/master/closed/NVIDIA/Makefile#L98

Anyways, if you want a quick dirty approach here, I'd expect you can navigate the build/preprocessed_data directory, supposedly for KiTS19 and if you find case_00400, delete them.

Since I am not sure exactly what causes above assertion, BTW, could you please see what numbers are reported by collected_set and target_set? I suspect collected_set is 43 whereas target_set is 42...

mahmoodn commented 2 years ago

I ran make build and saw this in the output:

(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ make build
Updating Loadgen
remote: Enumerating objects: 38, done.
remote: Counting objects: 100% (38/38), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 38 (delta 12), reused 23 (delta 8), pack-reused 0
Unpacking objects: 100% (38/38), 33.80 KiB | 3.75 MiB/s, done.
From https://github.com/mlcommons/inference
   1483c8f..b284212  master     -> origin/master
HEAD is now at c7ba22c Fix wrong scenario list in submission checker for sysystems submitted to both Datacenter and Edge (#1107)
Updating Power-Dev repo
HEAD is now at f7b3b51 Merge r1.1 changes into master (#261)
remote: Enumerating objects: 491, done.
remote: Counting objects: 100% (491/491), done.
remote: Compressing objects: 100% (209/209), done.
remote: Total 491 (delta 307), reused 430 (delta 270), pack-reused 0
Receiving objects: 100% (491/491), 242.28 KiB | 2.75 MiB/s, done.
Resolving deltas: 100% (307/307), completed with 24 local objects.
From https://github.com/triton-inference-server/server
 * [new branch]        dyas-capi-queue         -> origin/dyas-capi-queue
 * [new branch]        dyas-fix-tag            -> origin/dyas-fix-tag
 * [new branch]        dyas-log-id             -> origin/dyas-log-id
 * [new branch]        dyas-tag-fix            -> origin/dyas-tag-fix
   bf18477f..359ebd83  gluo-ci                 -> origin/gluo-ci
 * [new branch]        gluo-count              -> origin/gluo-count
 + a41204be...9540f33c gluo-stress             -> origin/gluo-stress  (forced update)
 * [new branch]        imant-decoupled-stats   -> origin/imant-decoupled-stats
 * [new branch]        imant-fix-bug           -> origin/imant-fix-bug
 * [new branch]        imant-throughput-latency -> origin/imant-throughput-latency
 * [new branch]        imant-warmup            -> origin/imant-warmup
 + deae1a35...a74cfd2d kmcgill-cpu-instance    -> origin/kmcgill-cpu-instance  (forced update)
   f9fe6fcc..4487c83e  kmcgill-main            -> origin/kmcgill-main
 * [new branch]        krish-bls               -> origin/krish-bls
   137bc903..12a6c20c  main                    -> origin/main
 * [new branch]        mchornyi-r22.06         -> origin/mchornyi-r22.06
 * [new branch]        mchornyi-tests-only     -> origin/mchornyi-tests-only
 + 067f765a...b5e616b8 mchornyi-windows        -> origin/mchornyi-windows  (forced update)
 * [new branch]        r22.06                  -> origin/r22.06
 * [new branch]        rmccormick-load-pool    -> origin/rmccormick-load-pool
 * [new branch]        rmccormick-load-pool-nb -> origin/rmccormick-load-pool-nb
 * [new branch]        tanmayv-doc             -> origin/tanmayv-doc
 * [new branch]        tanmayv-p0              -> origin/tanmayv-p0
fatal: ambiguous argument 'origin/master...mlperf-inference-v2.0': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Already on 'mlperf-inference-v2.0'
Your branch is up to date with 'origin/mlperf-inference-v2.0'.
make[1]: Entering directory '/work'
Building TensorRT Inference Server...
# Required till triton build.py properly supports incremental builds
platform linux
machine x86_64
version 2.19.0dev
default repo-tag: main
backend "tensorrt" at tag/branch "pull/21/head"
Building Triton Inference Server
component "common" at tag/branch "mlperf-inference-v2.0"
component "core" at tag/branch "mlperf-inference-v2.0"
component "backend" at tag/branch "mlperf-inference-v2.0"
component "thirdparty" at tag/branch "mlperf-inference-v2.0"
-- Configuring done
-- Generating done
-- Build files have been written to: /work/build/triton-inference-server/out/tritonserver/build
make[2]: Entering directory '/work/build/triton-inference-server/out/tritonserver/build'
/usr/local/lib/python3.8/dist-packages/cmake/data/bin/cmake -S/work/build/triton-inference-server/build -B/work/build/triton-inference-server/out/tritonserver/build --check-build-system CMakeFiles/Makefile.cmake 0
make  -f CMakeFiles/Makefile2 server
make[3]: Entering directory '/work/build/triton-inference-server/out/tritonserver/build'
/usr/local/lib/python3.8/dist-packages/cmake/data/bin/cmake -S/work/build/triton-inference-server/build -B/work/build/triton-inference-server/out/tritonserver/build --check-build-system CMakeFiles/Makefile.cmake 0
/usr/local/lib/python3.8/dist-packages/cmake/data/bin/cmake -E cmake_progress_start /work/build/triton-inference-server/out/tritonserver/build/CMakeFiles 21
make  -f CMakeFiles/Makefile2 CMakeFiles/server.dir/all
...

And then

(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ make preprocess_data BENCHMARKS="3d-unet"
Preprocessing /disk1/scratch/data/KiTS19/kits19/data...
Traceback (most recent call last):
  File "code/3d-unet/tensorrt/preprocess_data.py", line 886, in <module>
    main()
  File "code/3d-unet/tensorrt/preprocess_data.py", line 873, in main
    preprocess_kits19_raw_data(kits19tool)
  File "code/3d-unet/tensorrt/preprocess_data.py", line 502, in preprocess_kits19_raw_data
    preprocess_ref_with_multiproc(kits19tool)
  File "code/3d-unet/tensorrt/preprocess_data.py", line 481, in preprocess_ref_with_multiproc
    cases = preproc.collect_cases()
  File "code/3d-unet/tensorrt/preprocess_data.py", line 255, in collect_cases
    assert collected_set == target_set,\
AssertionError: Some of the target inference cases were NOT found: {'case_00400'}
make: *** [Makefile:464: preprocess_data] Error 1
(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ grep INFERENCE_HASH Makefile 
INFERENCE_HASH = c7ba22c1b2f918c6c9f2fbd7db17407fdb9d6e21
        && git checkout $(INFERENCE_HASH) \

I also checked the following paths:

(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ ls build/preprocessed_data/KiTS19/
reference
(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ ls build/preprocessed_data/KiTS19/reference/
(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$

I also modified

print("collected_set=", collected_set)
print("target_set=", target_set)
assert collected_set == target_set,\

and the output is

(mlperf) mnaderan@mlperf-inference-mnaderan-x86_64:/work$ make preprocess_data BENCHMARKS="3d-unet"
Preprocessing /disk1/scratch/data/KiTS19/kits19/data...
collected_set= {'case_00169', 'case_00076', 'case_00117', 'case_00176', 'case_00000', 'case_00052', 'case_00185', 'case_00049', 'case_00206', 'case_00130', 'case_00150', 'case_00196', 'case_00187', 'case_00189', 'case_00162', 'case_00070', 'case_00144', 'case_00160', 'case_00081', 'case_00203', 'case_00084', 'case_00207', 'case_00065', 'case_00066', 'case_00092', 'case_00172', 'case_00044', 'case_00086', 'case_00041', 'case_00006', 'case_00112', 'case_00073', 'case_00100', 'case_00171', 'case_00012', 'case_00047', 'case_00183', 'case_00087', 'case_00053', 'case_00141', 'case_00138', 'case_00128', 'case_00024', 'case_00034', 'case_00161', 'case_00056', 'case_00005', 'case_00080', 'case_00149', 'case_00078', 'case_00091', 'case_00136', 'case_00148', 'case_00111', 'case_00147', 'case_00003', 'case_00050', 'case_00090', 'case_00061', 'case_00125', 'case_00198', 'case_00157'}
target_set= {'case_00169', 'case_00076', 'case_00117', 'case_00176', 'case_00000', 'case_00052', 'case_00185', 'case_00049', 'case_00206', 'case_00130', 'case_00150', 'case_00196', 'case_00187', 'case_00189', 'case_00162', 'case_00070', 'case_00144', 'case_00160', 'case_00081', 'case_00203', 'case_00084', 'case_00207', 'case_00065', 'case_00066', 'case_00092', 'case_00172', 'case_00044', 'case_00086', 'case_00041', 'case_00006', 'case_00112', 'case_00073', 'case_00100', 'case_00171', 'case_00012', 'case_00047', 'case_00183', 'case_00087', 'case_00053', 'case_00141', 'case_00138', 'case_00128', 'case_00024', 'case_00034', 'case_00161', 'case_00056', 'case_00005', 'case_00080', 'case_00149', 'case_00078', 'case_00091', 'case_00136', 'case_00148', 'case_00111', 'case_00147', 'case_00003', 'case_00050', 'case_00090', 'case_00061', 'case_00400', 'case_00125', 'case_00198', 'case_00157'}
Traceback (most recent call last):
  File "code/3d-unet/tensorrt/preprocess_data.py", line 888, in <module>
    main()
...

mahmoodn commented 2 years ago

One more thing. I see the Makefile you mentioned before in inference_results_v2.0/closed/NVIDIA/build/inference/vision/medical_imaging/3d-unet-kits19/Makefile, but I don't know if I edit the file based on the patch, does that affect the "make preprocess_data" command or I have to rebuild it.

mahmoodn commented 2 years ago

Anyway, I removed "case_00400" in inference_cases.json and that error disappeared. I don't know exactly what I did, but I guess even with the default code, it is possible to modify inference_cases.json and remove the unwanted case.

nv-jinhosuh commented 2 years ago

Thank you @mahmoodn I think I see what's the problem now. :) inference_cases.json is obtained from the mlcommons/inference master, not from the inference model commit tag. I should think about fixing this in NVIDIA's model. Sorry for the inconvenience, and glad you were able to make forward progress.

nv-jinhosuh commented 2 years ago

@mahmoodn FYI, and for anyone hitting the same issue, please change closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh as below and it should behave consistently. Sorry for the inconvenience. Run in it /work dir in docker, or in closed/NVIDIA otherwise.

#!/bin/bash
# Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

DATA_DIR=${DATA_DIR:-build/data}
KITS_RAW_DIR=${DATA_DIR}/KiTS19/kits19/data
INFERENCE_HASH=${INFERENCE_HASH:-`grep "INFERENCE_HASH =" Makefile | sed "s/.*= //"`}

if [ ! -s ${KITS_RAW_DIR}/case_00137/imaging.nii.gz ]
then
    echo "Cloning KITS19 repo and download RAW data into ${KITS_RAW_DIR}..." && \
    pushd ${DATA_DIR} &&\
    rm -Rf KiTS19 &&\
    mkdir -p KiTS19 &&\
    cd KiTS19 &&\
    git clone https://github.com/neheller/kits19 &&\
    cd kits19 &&\
    pip3 install -r requirements.txt &&\
    python3 -m starter_code.get_imaging &&\
    popd &&\
    echo "Done."
else
    echo "Valid KITS RAW data set found in ${KITS_RAW_DIR}/, skipping download."
fi

sleep 0.1

echo "Downloading JSON files describing subset used for inference/calibration..."
wget https://raw.githubusercontent.com/mlcommons/inference/${INFERENCE_HASH}/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json -O ${DATA_DIR}/KiTS19/inference_cases.json
wget https://raw.githubusercontent.com/mlcommons/inference/${INFERENCE_HASH}/vision/medical_imaging/3d-unet-kits19/meta/calibration_cases.json -O ${DATA_DIR}/KiTS19/calibration_cases.json
echo "Done."

mlcommons / inference_results_v2.0

Missing 3d-unet data files for NVIDIA/ #8