Bob123Yang commented 2 months ago

Run the below cm commands for several times and always failed at the same place:

(cm) tomcat@tomcat-Dove-Product:~$ cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \
   --model=resnet50 \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=1000
INFO:root:* cm run script "run-mlperf inference _find-performance _full _r4.1-dev"
INFO:root:  * cm run script "get mlcommons inference src"
INFO:root:       ! load /home/tomcat/CM/repos/local/cache/c0c2d4df519a416f/cm-cached-state.json
INFO:root:  * cm run script "install pip-package for-cmind-python _package.tabulate"
INFO:root:       ! load /home/tomcat/CM/repos/local/cache/2a4f3deecef34560/cm-cached-state.json
INFO:root:  * cm run script "get mlperf inference utils"
INFO:root:    * cm run script "get mlperf inference src"
INFO:root:         ! load /home/tomcat/CM/repos/local/cache/c0c2d4df519a416f/cm-cached-state.json
INFO:root:         ! call "postprocess" from /home/tomcat/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils/customize.py
Using MLCommons Inference source from /home/tomcat/CM/repos/local/cache/91cad0cc764a49d3/inference

Running loadgen scenario: Offline and mode: performance
INFO:root:* cm run script "build dockerfile"

Dockerfile generated at /home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile
INFO:root:* cm run script "get docker"
INFO:root:     ! load /home/tomcat/CM/repos/local/cache/1c757c4f3d3e4a06/cm-cached-state.json
INFO:root:* cm run script "get mlperf inference results dir local"
INFO:root:     ! load /home/tomcat/CM/repos/local/cache/966e187bf39a46c8/cm-cached-state.json
INFO:root:* cm run script "get mlperf inference submission dir local"
INFO:root:     ! load /home/tomcat/CM/repos/local/cache/e880b27a4cf14bc8/cm-cached-state.json
INFO:root:* cm run script "get dataset imagenet validation original _full"
INFO:root:     ! load /home/tomcat/CM/repos/local/cache/87a60fb1d8344aeb/cm-cached-state.json
INFO:root:* cm run script "get nvidia-docker"
INFO:root:     ! load /home/tomcat/CM/repos/local/cache/f925db34327f4882/cm-cached-state.json
INFO:root:* cm run script "get mlperf inference nvidia scratch space"
INFO:root:     ! load /home/tomcat/CM/repos/local/cache/0cf60773c3484f98/cm-cached-state.json
INFO:root:* cm run script "get nvidia-docker"
INFO:root:     ! load /home/tomcat/CM/repos/local/cache/f925db34327f4882/cm-cached-state.json

CM command line regenerated to be used inside Docker:

cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_default,_offline --quiet=true --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=nvidia --env.CM_MLPERF_MODEL=resnet50 --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=tensorrt --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=1000 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=yes --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r4.1-dev_default --env.CM_MLPERF_LAST_RELEASE=v4.0 --env.CM_TMP_CURRENT_PATH=/home/tomcat --env.CM_TMP_PIP_VERSION_STRING= --env.CM_MODEL=resnet50 --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --v=False --print_env=False --print_deps=False --dump_version_info=True  --env.CM_DATASET_IMAGENET_PATH=/home/cmuser/CM/repos/local/cache/87a60fb1d8344aeb/imagenet-2012-val  --env.CM_MLPERF_INFERENCE_RESULTS_DIR=/home/cmuser/CM/repos/local/cache/966e187bf39a46c8  --env.CM_MLPERF_INFERENCE_SUBMISSION_DIR=/home/cmuser/CM/repos/local/cache/e880b27a4cf14bc8/mlperf-inference-submission  --env.MLPERF_SCRATCH_PATH=/home/cmuser/CM/repos/local/cache/0cf60773c3484f98  --docker_run_deps 

INFO:root:* cm run script "run docker container"

Checking Docker images:

  docker images -q local/cm-script-app-mlperf-inference:ubuntu-20.04-latest 2> /dev/null

INFO:root:  * cm run script "build docker image"
================================================
CM generated the following Docker build command:

docker build  --build-arg GID=\" $(id -g $USER) \" --build-arg UID=\" $(id -u $USER) \" -f "/home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile" -t "local/cm-script-app-mlperf-inference:ubuntu-20.04-latest" .

INFO:root:         ! cd /home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles
INFO:root:         ! call /home/tomcat/CM/repos/mlcommons@cm4mlops/script/build-docker-image/run.sh from tmp-run.sh
[+] Building 34.3s (17/17) FINISHED                                                                                   docker:rootless
 => [internal] load build definition from mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile    0.0s
 => => transferring dockerfile: 3.03kB                                                                                           0.0s
 => WARN: SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "CM_GH_TOKEN") (line 14)             0.0s
 => [internal] load metadata for nvcr.io/nvidia/mlperf/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public  0.0s
 => [internal] load .dockerignore                                                                                                0.0s
 => => transferring context: 45B                                                                                                 0.0s
 => [ 1/14] FROM nvcr.io/nvidia/mlperf/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public                  0.0s
 => CACHED [ 2/14] RUN apt-get update -y                                                                                         0.0s
 => CACHED [ 3/14] RUN apt-get install -y python3 python3-pip git sudo wget python3-venv                                         0.0s
 => CACHED [ 4/14] RUN ln -snf /usr/share/zoneinfo/US/Pacific /etc/localtime && echo US/Pacific >/etc/timezone                   0.0s
 => CACHED [ 5/14] RUN groupadd -g  1001  -o cm                                                                                  0.0s
 => CACHED [ 6/14] RUN useradd -m -u  1001  -g  1001  -o --create-home --shell /bin/bash cmuser                                  0.0s
 => CACHED [ 7/14] RUN echo "cmuser ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers                                                     0.0s
 => CACHED [ 8/14] WORKDIR /home/cmuser                                                                                          0.0s
 => CACHED [ 9/14] RUN python3 -m venv cm-venv                                                                                   0.0s
 => CACHED [10/14] RUN . cm-venv/bin/activate                                                                                    0.0s
 => CACHED [11/14] RUN python3 -m pip install --user cmind requests giturlparse tabulate                                         0.0s
 => CACHED [12/14] RUN cm pull repo mlcommons@cm4mlops  --branch=mlperf-inference                                                0.0s
 => CACHED [13/14] RUN cm run script --tags=get,sys-utils-cm --quiet                                                             0.0s
 => ERROR [14/14] RUN cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_defa  34.2s
------                                                                                                                                
 > [14/14] RUN cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_default,_offline --quiet=true --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=nvidia --env.CM_MLPERF_MODEL=resnet50 --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=tensorrt --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=1000 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=yes --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r4.1-dev_default --env.CM_MLPERF_LAST_RELEASE=v4.0 --env.CM_MODEL=resnet50 --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --v=False --print_env=False --print_deps=False --dump_version_info=True --quiet --fake_run --env.CM_RUN_STATE_DOCKER=True:
0.304 INFO:root:* cm run script "app mlperf inference generic _nvidia _resnet50 _tensorrt _cuda _test _r4.1-dev_default _offline"
0.304 DEBUG:root:  - Number of scripts found: 1
0.304 DEBUG:root:  - Found script::app-mlperf-inference,d775cac873ee4231 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference
0.304 DEBUG:root:    Prepared variations: _nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_default,_offline,_nvidia-original,_float32
0.305 INFO:root:  - Doing fake run for script::app-mlperf-inference,d775cac873ee4231 as we are inside docker
0.305 DEBUG:root:  - Checking dependencies on other CM scripts:
0.313 INFO:root:  * cm run script "detect os"
0.313 DEBUG:root:    - Number of scripts found: 1
0.313 DEBUG:root:    - Found script::detect-os,863735b7db8c44fc in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os
0.314 DEBUG:root:    - Running preprocess ...
0.321 DEBUG:root:    - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser" ...
0.321 INFO:root:         ! cd /home/cmuser
0.321 INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
0.374 INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
0.374 DEBUG:root:    - Running postprocess ...
0.397 INFO:root:    - running time of script "detect-os,detect,os,info": 0.09 sec.
0.411 INFO:root:  * cm run script "get sys-utils-cm"
0.411 DEBUG:root:    - Number of scripts found: 1
0.411 DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,sys-utils-cm
0.412 DEBUG:root:      - Number of cached script outputs found: 1
0.412 DEBUG:root:    - Found script::get-sys-utils-cm,bc90993277e84b8e in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-sys-utils-cm
0.412 DEBUG:root:    - Checking if script execution is already cached ...
0.412 DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,sys-utils-cm
0.412 DEBUG:root:      - Found cached script output: /home/cmuser/CM/repos/local/cache/7fb8ab8c980f416d
0.412 DEBUG:root:    - Checking dynamic dependencies on other CM scripts:
0.412 DEBUG:root:    - Processing env after dependencies ...
0.412 DEBUG:root:      - Checking prehook dependencies on other CM scripts:
0.412 DEBUG:root:        - Loading state from cached entry ...
0.412 INFO:root:       ! load /home/cmuser/CM/repos/local/cache/7fb8ab8c980f416d/cm-cached-state.json
0.412 DEBUG:root:      - Checking posthook dependencies on other CM scripts:
0.412 DEBUG:root:      - Checking post dependencies on other CM scripts:
0.412 INFO:root:    - running time of script "get,sys-utils-cm": 0.01 sec.
0.422 INFO:root:  * cm run script "get python"
0.422 DEBUG:root:    - Number of scripts found: 1
0.422 DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,python
0.424 DEBUG:root:      - Number of cached script outputs found: 0
0.424 DEBUG:root:    - Found script::get-python3,d0b5dd74373f4a62 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3
0.424 DEBUG:root:    - Checking if script execution is already cached ...
0.424 DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,python,python3,get-python,get-python3
0.425 DEBUG:root:    - Creating new "cache" script artifact in the CM local repository ...
0.425 DEBUG:root:      - Tags: tmp,get,python,python3,get-python,get-python3,script-artifact-d0b5dd74373f4a62
0.457 DEBUG:root:    - Changing to /home/cmuser/CM/repos/local/cache/ad89dcd829234069
0.459 DEBUG:root:    - Running preprocess ...
0.477 INFO:root:      * /usr/bin/python3
0.477 DEBUG:root:        - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser/CM/repos/local/cache/ad89dcd829234069" ...
0.477 INFO:root:             ! cd /home/cmuser/CM/repos/local/cache/ad89dcd829234069
0.477 INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3/run.sh from tmp-run.sh
0.494 INFO:root:             ! call "detect_version" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3/customize.py
0.494 DEBUG:root:        - Running detect_version ...
0.503 INFO:root:      # Found artifact in /usr/bin/python3
0.503 DEBUG:root:    - Checking prehook dependencies on other CM scripts:
0.503 DEBUG:root:    - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser/CM/repos/local/cache/ad89dcd829234069" ...
0.503 INFO:root:         ! cd /home/cmuser/CM/repos/local/cache/ad89dcd829234069
0.503 INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3/run.sh from tmp-run.sh
0.520 INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3/customize.py
0.520 DEBUG:root:    - Running postprocess ...
0.541 DEBUG:root:    - Removing tmp tag in the script cached output ad89dcd829234069 ...
0.551 INFO:root:    - cache UID: ad89dcd829234069
0.551 INFO:root:    - running time of script "get,python,python3,get-python,get-python3": 0.14 sec.
0.551 INFO:root:Path to Python: /usr/bin/python3
0.551 INFO:root:Python version: 3.8.10
0.560 INFO:root:  * cm run script "get mlcommons inference src"
0.560 DEBUG:root:    - Number of scripts found: 1
0.560 DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,mlcommons,inference,src
0.562 DEBUG:root:      - Number of cached script outputs found: 0
0.562 DEBUG:root:    - Found script::get-mlperf-inference-src,4b57186581024797 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-src
0.562 DEBUG:root:      Prepared variations: _short-history
0.562 DEBUG:root:    - Checking if script execution is already cached ...
0.562 DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,mlcommons,inference,src,source,inference-src,inference-source,mlperf
0.563 DEBUG:root:    - Creating new "cache" script artifact in the CM local repository ...
0.563 DEBUG:root:      - Tags: tmp,get,mlcommons,inference,src,source,inference-src,inference-source,mlperf,script-artifact-4b57186581024797
0.573 DEBUG:root:    - Changing to /home/cmuser/CM/repos/local/cache/52402a279e584760
0.573 DEBUG:root:    - Checking dependencies on other CM scripts:
0.582 INFO:root:    * cm run script "detect os"
0.582 DEBUG:root:      - Number of scripts found: 1
0.582 DEBUG:root:      - Found script::detect-os,863735b7db8c44fc in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os
0.583 DEBUG:root:      - Running preprocess ...
0.589 DEBUG:root:      - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser/CM/repos/local/cache/52402a279e584760" ...
0.589 INFO:root:           ! cd /home/cmuser/CM/repos/local/cache/52402a279e584760
0.589 INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
0.647 INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
0.647 DEBUG:root:      - Running postprocess ...
0.669 INFO:root:      - running time of script "detect-os,detect,os,info": 0.09 sec.
0.678 INFO:root:    * cm run script "get python3"
0.678 DEBUG:root:      - Number of scripts found: 1
0.678 DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,python3
0.679 DEBUG:root:        - Number of cached script outputs found: 1
0.679 DEBUG:root:      - Found script::get-python3,d0b5dd74373f4a62 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-python3
0.679 DEBUG:root:      - Checking if script execution is already cached ...
0.679 DEBUG:root:        - Searching for cached script outputs with the following tags: -tmp,get,python3,python,get-python,get-python3
0.679 DEBUG:root:        - Found cached script output: /home/cmuser/CM/repos/local/cache/ad89dcd829234069
0.679 DEBUG:root:        - Checking prehook dependencies on other CM scripts:
0.679 DEBUG:root:          - Loading state from cached entry ...
0.679 INFO:root:         ! load /home/cmuser/CM/repos/local/cache/ad89dcd829234069/cm-cached-state.json
0.679 DEBUG:root:        - Checking posthook dependencies on other CM scripts:
0.679 DEBUG:root:        - Checking post dependencies on other CM scripts:
0.679 INFO:root:      - running time of script "get,python,python3,get-python,get-python3": 0.01 sec.
0.679 INFO:root:Path to Python: /usr/bin/python3
0.679 INFO:root:Python version: 3.8.10
0.679 DEBUG:root:    - Processing env after dependencies ...
0.680 DEBUG:root:    - Running preprocess ...
0.688 DEBUG:root:    - Checking prehook dependencies on other CM scripts:
0.697 INFO:root:    * cm run script "get git repo _branch.master _repo.https://github.com/mlcommons/inference"
0.697 DEBUG:root:      - Number of scripts found: 1
0.697 DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,git,repo,_branch.master,_repo.https://github.com/mlcommons/inference
0.699 DEBUG:root:        - Number of cached script outputs found: 0
0.699 DEBUG:root:      - Found script::get-git-repo,ed603e7292974f10 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo
0.699 DEBUG:root:        Prepared variations: _branch.master,_repo.https://github.com/mlcommons/inference,_short-history
0.699 DEBUG:root:      - Checking if script execution is already cached ...
0.699 DEBUG:root:        - Prepared explicit variations: _branch.master,_repo.https://github.com/mlcommons/inference
0.699 DEBUG:root:        - Prepared variations: _branch.master,_repo.https://github.com/mlcommons/inference
0.699 DEBUG:root:        - Searching for cached script outputs with the following tags: -tmp,get,git,repo,repository,clone,_branch.master,_repo.https://github.com/mlcommons/inference,inference,src
0.701 DEBUG:root:      - Creating new "cache" script artifact in the CM local repository ...
0.701 DEBUG:root:        - Tags: tmp,get,git,repo,repository,clone,_branch.master,_repo.https://github.com/mlcommons/inference,inference,src,script-artifact-ed603e7292974f10
0.712 DEBUG:root:      - Changing to /home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc
0.712 DEBUG:root:      - Checking dependencies on other CM scripts:
0.720 INFO:root:      * cm run script "detect os"
0.720 DEBUG:root:        - Number of scripts found: 1
0.720 DEBUG:root:        - Found script::detect-os,863735b7db8c44fc in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os
0.721 DEBUG:root:        - Running preprocess ...
0.729 DEBUG:root:        - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc" ...
0.729 INFO:root:             ! cd /home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc
0.729 INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
0.790 INFO:root:             ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
0.790 DEBUG:root:        - Running postprocess ...
0.812 INFO:root:        - running time of script "detect-os,detect,os,info": 0.10 sec.
0.812 DEBUG:root:      - Processing env after dependencies ...
0.815 DEBUG:root:      - Running preprocess ...
0.841 DEBUG:root:      - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc" ...
0.841 INFO:root:           ! cd /home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc
0.841 INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
0.854 /home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc
0.858 ******************************************************
0.858 Current directory: /home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc
0.858 
0.858 Cloning inference from https://github.com/mlcommons/inference
0.858 
0.858 git clone  -b master https://github.com/mlcommons/inference --depth 5 inference
0.858 
0.862 Cloning into 'inference'...
33.75 git rev-parse HEAD >> ../tmp-cm-git-hash.out
33.76 INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/customize.py
33.76 DEBUG:root:      - Running postprocess ...
33.78 DEBUG:root:      - Removing tmp tag in the script cached output 25ce9a8240b54dbc ...
33.79 INFO:root:      - cache UID: 25ce9a8240b54dbc
33.79 INFO:root:      - running time of script "get,git,repo,repository,clone": 33.11 sec.
33.79 INFO:root:CM cache path to the Git repo: /home/cmuser/CM/repos/local/cache/25ce9a8240b54dbc/inference
33.79 INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-src/customize.py
33.79 DEBUG:root:    - Running postprocess ...
33.83 DEBUG:root:    - Removing tmp tag in the script cached output 52402a279e584760 ...
33.84 INFO:root:    - cache UID: 52402a279e584760
33.84 INFO:root:    - running time of script "get,src,source,inference,inference-src,inference-source,mlperf,mlcommons": 33.29 sec.
33.85 INFO:root:  * cm run script "get mlperf inference utils"
33.85 DEBUG:root:    - Number of scripts found: 1
33.85 DEBUG:root:    - Found script::get-mlperf-inference-utils,e341e5f86d8342e5 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils
33.85 DEBUG:root:    - Checking dependencies on other CM scripts:
33.86 INFO:root:    * cm run script "get mlperf inference src"
33.86 DEBUG:root:      - Number of scripts found: 1
33.86 DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,mlperf,inference,src
33.86 DEBUG:root:        - Number of cached script outputs found: 1
33.86 DEBUG:root:      - Found script::get-mlperf-inference-src,4b57186581024797 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-src
33.86 DEBUG:root:        Prepared variations: _short-history
33.86 DEBUG:root:      - Checking if script execution is already cached ...
33.86 DEBUG:root:        - Searching for cached script outputs with the following tags: -tmp,get,mlperf,inference,src,source,inference-src,inference-source,mlcommons
33.86 DEBUG:root:        - Found cached script output: /home/cmuser/CM/repos/local/cache/52402a279e584760
33.86 DEBUG:root:      - Checking dynamic dependencies on other CM scripts:
33.86 DEBUG:root:      - Processing env after dependencies ...
33.86 DEBUG:root:        - Checking prehook dependencies on other CM scripts:
33.86 DEBUG:root:          - Loading state from cached entry ...
33.86 INFO:root:         ! load /home/cmuser/CM/repos/local/cache/52402a279e584760/cm-cached-state.json
33.86 DEBUG:root:        - Checking posthook dependencies on other CM scripts:
33.86 DEBUG:root:        - Checking post dependencies on other CM scripts:
33.86 INFO:root:      - running time of script "get,src,source,inference,inference-src,inference-source,mlperf,mlcommons": 0.01 sec.
33.86 DEBUG:root:    - Processing env after dependencies ...
33.86 DEBUG:root:    - Running preprocess ...
33.87 INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils/customize.py
33.87 DEBUG:root:    - Running postprocess ...
33.88 INFO:root:    - running time of script "get,mlperf,inference,util,utils,functions": 0.04 sec.
33.89 INFO:root:  * cm run script "get dataset-aux imagenet-aux"
33.89 DEBUG:root:    - Number of scripts found: 1
33.89 DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,dataset-aux,imagenet-aux
33.89 DEBUG:root:      - Number of cached script outputs found: 0
33.89 DEBUG:root:    - Found script::get-dataset-imagenet-aux,bb2c6dd8c8c64217 in /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-dataset-imagenet-aux
33.89 DEBUG:root:      Prepared variations: _from.dropbox,_2012
33.89 DEBUG:root:    - Checking if script execution is already cached ...
33.89 DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,dataset-aux,imagenet-aux,aux,image-classification
33.89 DEBUG:root:    - Creating new "cache" script artifact in the CM local repository ...
33.89 DEBUG:root:      - Tags: tmp,get,dataset-aux,imagenet-aux,aux,image-classification,script-artifact-bb2c6dd8c8c64217
33.90 DEBUG:root:    - Changing to /home/cmuser/CM/repos/local/cache/3c98081d41ee42db
33.90 DEBUG:root:    - Running native script "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-dataset-imagenet-aux/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser/CM/repos/local/cache/3c98081d41ee42db" ...
33.90 INFO:root:         ! cd /home/cmuser/CM/repos/local/cache/3c98081d41ee42db
33.90 INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-dataset-imagenet-aux/run.sh from tmp-run.sh
33.92 
33.92 --2024-09-10 00:38:07--  https://www.dropbox.com/s/92n2fyej3lzy3s3/caffe_ilsvrc12.tar.gz
33.92 Resolving www.dropbox.com (www.dropbox.com)... 162.125.85.18
33.97 Connecting to www.dropbox.com (www.dropbox.com)|162.125.85.18|:443... connected.
34.15 Unable to establish SSL connection.
34.15         Detected version: 3.8.10
34.15         Detected version: 3.8.10
34.15 
34.15 CM error: Portable CM script failed (name = get-dataset-imagenet-aux, return code = 256)
34.15 
34.15 
34.15 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
34.15 Note that it is often a portability issue of a third-party tool or a native script 
34.15 wrapped and unified by this CM script (automation recipe). Please re-run
34.15 this script with --repro flag and report this issue with the original
34.15 command line, cm-repro directory and full log here:
34.15 
34.15 https://github.com/mlcommons/cm4mlops/issues
34.15 
34.15 The CM concept is to collaboratively fix such issues inside portable CM scripts 
34.15 to make existing tools and native scripts more portable, interoperable 
34.15 and deterministic. Thank you!
------

 1 warning found (use docker --debug to expand):
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "CM_GH_TOKEN") (line 14)
mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile:47
--------------------
  45 |     
  46 |     # Run commands
  47 | >>> RUN cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_default,_offline --quiet=true --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=nvidia --env.CM_MLPERF_MODEL=resnet50 --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=tensorrt --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=1000 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=yes --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r4.1-dev_default --env.CM_MLPERF_LAST_RELEASE=v4.0 --env.CM_MODEL=resnet50 --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --v=False --print_env=False --print_deps=False --dump_version_info=True --quiet --fake_run --env.CM_RUN_STATE_DOCKER=True  
  48 |     
--------------------
ERROR: failed to solve: process "/bin/bash -c cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_default,_offline --quiet=true --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=nvidia --env.CM_MLPERF_MODEL=resnet50 --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=tensorrt --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=1000 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=yes --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r4.1-dev_default --env.CM_MLPERF_LAST_RELEASE=v4.0 --env.CM_MODEL=resnet50 --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --v=False --print_env=False --print_deps=False --dump_version_info=True --quiet --fake_run --env.CM_RUN_STATE_DOCKER=True" did not complete successfully: exit code: 2

CM error: Portable CM script failed (name = build-docker-image, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script 
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts 
to make existing tools and native scripts more portable, interoperable 
and deterministic. Thank you!
(cm) tomcat@tomcat-Dove-Product:~$

arjunsuresh commented 2 months ago

34.15 Unable to establish SSL connection.

Are you behind some proxy?

Bob123Yang commented 2 months ago

I didn't use any proxy by myself, but I'm in the company's internal network, but I think It quite quite common, right？

Could you help give some suggestions for this situation? Thanks.

arjunsuresh commented 2 months ago

Are you able to do

wget https://www.dropbox.com/s/92n2fyej3lzy3s3/caffe_ilsvrc12.tar.gz

on your shell?

Bob123Yang commented 2 months ago

I'm not next to the machine and will try it later......

Bob123Yang commented 2 months ago

cm) tomcat@tomcat-Dove-Product:~$ wget https://www.dropbox.com/s/92n2fyej3lzy3s3/caffe_ilsvrc12.tar.gz
--2024-09-11 08:35:34--  https://www.dropbox.com/s/92n2fyej3lzy3s3/caffe_ilsvrc12.tar.gz
Resolving www.dropbox.com (www.dropbox.com)... 31.13.94.37, 2a03:2880:f11f:83:face:b00c:0:25de
Connecting to www.dropbox.com (www.dropbox.com)|31.13.94.37|:443... failed: Connection timed out.
Connecting to www.dropbox.com (www.dropbox.com)|2a03:2880:f11f:83:face:b00c:0:25de|:443... failed: Network is unreachable.
(cm) tomcat@tomcat-Lenovo-Product:~$

Bob123Yang commented 2 months ago

@arjunsuresh Is there any other method to prepare the package “caffe_ilsvrc12.tar.gz” for docker instead of downloading it from www.dropbox.com?

arjunsuresh commented 2 months ago

@Bob123Yang yes, we can find a way. But since this is not the only download in the workflow it'll be good to know what is happening. Is dropbox URLs blocked in your network? All other URLs are expected to work?

Bob123Yang commented 2 months ago

Yeah, it looks like that dropbox URLs is blocked here and the others seems good.

So how can I do to bypass this problem? I really don't want to be stopped by a download...

arjunsuresh commented 2 months ago

That's great. We have now added backup URL support in CM. Can you please do cm pull repo and retry? For the docker run, please add --docker_cache=no option to pull the latest changes.

Bob123Yang commented 2 months ago

@arjunsuresh I tried several times following your guide and please help review the log as below. It seems that no download problem but failed at clone every time at 14/14.

(cm) tomcat@tomcat-Dove-Product:~$ cm pull repo

Alias: mlcommons@cm4mlops

Local path: /home/tomcat/CM/repos/mlcommons@cm4mlops

git pull

remote: Enumerating objects: 161, done. remote: Counting objects: 100% (161/161), done. remote: Compressing objects: 100% (67/67), done. remote: Total 161 (delta 107), reused 142 (delta 94), pack-reused 0 (from 0) Receiving objects: 100% (161/161), 66.41 KiB | 800.00 KiB/s, done. Resolving deltas: 100% (107/107), completed with 14 local objects. From https://github.com/mlcommons/cm4mlops be6b63f57..6ce857cab mlperf-inference -> origin/mlperf-inference

2f9e79b12...4f6176f59 gh-pages -> origin/gh-pages (forced update) Updating be6b63f57..6ce857cab Fast-forward script/app-mlperf-inference-nvidia/_cm.yaml | 17 ++++++++++++++++- script/app-mlperf-inference/_cm.yaml | 20 +++++++++++++++----- script/build-mlperf-inference-server-nvidia/_cm.yaml | 30 +++++++++++++++++++++++++++++- script/build-mlperf-inference-server-nvidia/customize.py | 3 ++- script/download-file/customize.py | 4 ++-- script/download-file/run.sh | 4 +--- script/extract-file/customize.py | 3 +++ script/get-dataset-imagenet-aux/_cm.json | 24 ++++++++++++++++++++++-- script/get-dataset-imagenet-aux/run.sh | 15 --------------- script/get-ml-model-bert-large-squad/_cm.json | 12 ++++++++---- script/get-mlperf-inference-nvidia-common-code/_cm.json | 7 +++++++ script/get-mlperf-inference-results/_cm.json | 8 ++++++++ script/run-mlperf-inference-app/_cm.yaml | 7 +++++++ 13 files changed, 120 insertions(+), 34 deletions(-) delete mode 100644 script/get-dataset-imagenet-aux/run.sh

CM alias for this repository: mlcommons@cm4mlops

Reindexing all CM artifacts. Can take some time ... Took 0.6 sec.

(cm) tomcat@tomcat-Dove-Product:~$ cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=resnet50 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=1000 \ --docker_cache=no

INFO:root: cm run script "run-mlperf inference _find-performance _full _r4.1-dev" INFO:root: cm run script "get mlcommons inference src" INFO:root: ! load /home/tomcat/CM/repos/local/cache/c0c2d4df519a416f/cm-cached-state.json INFO:root: cm run script "install pip-package for-cmind-python _package.tabulate" INFO:root: ! load /home/tomcat/CM/repos/local/cache/2a4f3deecef34560/cm-cached-state.json INFO:root: cm run script "get mlperf inference utils" INFO:root: * cm run script "get mlperf inference src" INFO:root: ! load /home/tomcat/CM/repos/local/cache/c0c2d4df519a416f/cm-cached-state.json INFO:root: ! call "postprocess" from /home/tomcat/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils/customize.py Using MLCommons Inference source from /home/tomcat/CM/repos/local/cache/91cad0cc764a49d3/inference

Running loadgen scenario: Offline and mode: performance INFO:root:* cm run script "build dockerfile"

Dockerfile generated at /home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile INFO:root: cm run script "get docker" INFO:root: ! load /home/tomcat/CM/repos/local/cache/1c757c4f3d3e4a06/cm-cached-state.json INFO:root: cm run script "get mlperf inference results dir local" INFO:root: ! load /home/tomcat/CM/repos/local/cache/966e187bf39a46c8/cm-cached-state.json INFO:root: cm run script "get mlperf inference submission dir local" INFO:root: ! load /home/tomcat/CM/repos/local/cache/e880b27a4cf14bc8/cm-cached-state.json INFO:root: cm run script "get dataset imagenet validation original _full" INFO:root: ! load /home/tomcat/CM/repos/local/cache/87a60fb1d8344aeb/cm-cached-state.json INFO:root: cm run script "get nvidia-docker" INFO:root: ! load /home/tomcat/CM/repos/local/cache/f925db34327f4882/cm-cached-state.json INFO:root: cm run script "get mlperf inference nvidia scratch space" INFO:root: ! load /home/tomcat/CM/repos/local/cache/0cf60773c3484f98/cm-cached-state.json

CM command line regenerated to be used inside Docker:

cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_default,_offline --quiet=true --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=nvidia --env.CM_MLPERF_MODEL=resnet50 --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=tensorrt --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=1000 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=yes --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r4.1-dev_default --env.CM_MLPERF_LAST_RELEASE=v4.0 --env.CM_TMP_CURRENT_PATH=/home/tomcat --env.CM_TMP_PIP_VERSION_STRING= --env.CM_MODEL=resnet50 --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --v=False --print_env=False --print_deps=False --dump_version_info=True --env.CM_DATASET_IMAGENET_PATH=/home/cmuser/CM/repos/local/cache/87a60fb1d8344aeb/imagenet-2012-val --env.CM_MLPERF_INFERENCE_RESULTS_DIR=/home/cmuser/CM/repos/local/cache/966e187bf39a46c8 --env.CM_MLPERF_INFERENCE_SUBMISSION_DIR=/home/cmuser/CM/repos/local/cache/e880b27a4cf14bc8/mlperf-inference-submission --env.MLPERF_SCRATCH_PATH=/home/cmuser/CM/repos/local/cache/0cf60773c3484f98 --docker_run_deps

INFO:root:* cm run script "run docker container"

Checking Docker images:

docker images -q local/cm-script-app-mlperf-inference:ubuntu-20.04-latest 2> /dev/null

INFO:root: * cm run script "build docker image"

CM generated the following Docker build command:

docker build --no-cache --build-arg GID=\" $(id -g $USER) \" --build-arg UID=\" $(id -u $USER) \" -f "/home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile" -t "local/cm-script-app-mlperf-inference:ubuntu-20.04-latest" .

INFO:root: ! cd /home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles INFO:root: ! call /home/tomcat/CM/repos/mlcommons@cm4mlops/script/build-docker-image/run.sh from tmp-run.sh [+] Building 28772.6s (17/17) FINISHED docker:rootless => [internal] load build definition from mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfil 0.0s => => transferring dockerfile: 3.03kB 0.0s => WARN: SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "CM_GH_TOKEN") (line 14) 0.0s => [internal] load metadata for nvcr.io/nvidia/mlperf/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-pub 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 45B 0.0s => CACHED [ 1/14] FROM nvcr.io/nvidia/mlperf/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public 0.0s => [ 2/14] RUN apt-get update -y 39.5s => [ 3/14] RUN apt-get install -y python3 python3-pip git sudo wget python3-venv 79.4s => [ 4/14] RUN ln -snf /usr/share/zoneinfo/US/Pacific /etc/localtime && echo US/Pacific >/etc/timezone 0.2s => [ 5/14] RUN groupadd -g 1001 -o cm 0.3s => [ 6/14] RUN useradd -m -u 1001 -g 1001 -o --create-home --shell /bin/bash cmuser 0.3s => [ 7/14] RUN echo "cmuser ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers 0.3s => [ 8/14] WORKDIR /home/cmuser 0.0s => [ 9/14] RUN python3 -m venv cm-venv 1.5s => [10/14] RUN . cm-venv/bin/activate 0.2s => [11/14] RUN python3 -m pip install --user cmind requests giturlparse tabulate 25.6s => [12/14] RUN cm pull repo mlcommons@cm4mlops --branch=mlperf-inference 29.4s => [13/14] RUN cm run script --tags=get,sys-utils-cm --quiet 524.9s => CANCELED [14/14] RUN cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1 28071.0s

1 warning found (use docker --debug to expand):

SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "CM_GH_TOKEN") (line 14)
ERROR: failed to solve: Canceled: context canceled

CM error: Portable CM script failed (name = build-docker-image, return code = 2)
^C

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!

(cm) tomcat@tomcat-Dove-Product:~$ cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev --model=resnet50 --implementation=nvidia --framework=tensorrt --category=edge --scenario=Offline --execution_mode=test --device=cuda --docker --quiet --test_query_count=1000

INFO:root: cm run script "run-mlperf inference _find-performance _full _r4.1-dev" INFO:root: cm run script "get mlcommons inference src" INFO:root: ! load /home/tomcat/CM/repos/local/cache/c0c2d4df519a416f/cm-cached-state.json INFO:root: cm run script "install pip-package for-cmind-python _package.tabulate" INFO:root: ! load /home/tomcat/CM/repos/local/cache/2a4f3deecef34560/cm-cached-state.json INFO:root: cm run script "get mlperf inference utils" INFO:root: * cm run script "get mlperf inference src" INFO:root: ! load /home/tomcat/CM/repos/local/cache/c0c2d4df519a416f/cm-cached-state.json INFO:root: ! call "postprocess" from /home/tomcat/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils/customize.py Using MLCommons Inference source from /home/tomcat/CM/repos/local/cache/91cad0cc764a49d3/inference

Running loadgen scenario: Offline and mode: performance INFO:root:* cm run script "build dockerfile"

Dockerfile generated at /home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile INFO:root: cm run script "get docker" INFO:root: ! load /home/tomcat/CM/repos/local/cache/1c757c4f3d3e4a06/cm-cached-state.json INFO:root: cm run script "get mlperf inference results dir local" INFO:root: ! load /home/tomcat/CM/repos/local/cache/966e187bf39a46c8/cm-cached-state.json INFO:root: cm run script "get mlperf inference submission dir local" INFO:root: ! load /home/tomcat/CM/repos/local/cache/e880b27a4cf14bc8/cm-cached-state.json INFO:root: cm run script "get dataset imagenet validation original _full" INFO:root: ! load /home/tomcat/CM/repos/local/cache/87a60fb1d8344aeb/cm-cached-state.json INFO:root: cm run script "get nvidia-docker" INFO:root: ! load /home/tomcat/CM/repos/local/cache/f925db34327f4882/cm-cached-state.json INFO:root: cm run script "get mlperf inference nvidia scratch space" INFO:root: ! load /home/tomcat/CM/repos/local/cache/0cf60773c3484f98/cm-cached-state.json

CM command line regenerated to be used inside Docker:

cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1-dev_default,_offline --quiet=true --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=nvidia --env.CM_MLPERF_MODEL=resnet50 --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=tensorrt --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=1000 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=yes --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r4.1-dev_default --env.CM_MLPERF_LAST_RELEASE=v4.0 --env.CM_TMP_CURRENT_PATH=/home/tomcat --env.CM_TMP_PIP_VERSION_STRING= --env.CM_MODEL=resnet50 --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --v=False --print_env=False --print_deps=False --dump_version_info=True --env.CM_DATASET_IMAGENET_PATH=/home/cmuser/CM/repos/local/cache/87a60fb1d8344aeb/imagenet-2012-val --env.CM_MLPERF_INFERENCE_RESULTS_DIR=/home/cmuser/CM/repos/local/cache/966e187bf39a46c8 --env.CM_MLPERF_INFERENCE_SUBMISSION_DIR=/home/cmuser/CM/repos/local/cache/e880b27a4cf14bc8/mlperf-inference-submission --env.MLPERF_SCRATCH_PATH=/home/cmuser/CM/repos/local/cache/0cf60773c3484f98 --docker_run_deps

INFO:root:* cm run script "run docker container"

Checking Docker images:

docker images -q local/cm-script-app-mlperf-inference:ubuntu-20.04-latest 2> /dev/null

INFO:root: * cm run script "build docker image"

CM generated the following Docker build command:

docker build --build-arg GID=\" $(id -g $USER) \" --build-arg UID=\" $(id -u $USER) \" -f "/home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile" -t "local/cm-script-app-mlperf-inference:ubuntu-20.04-latest" .

INFO:root: ! cd /home/tomcat/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/dockerfiles INFO:root: ! call /home/tomcat/CM/repos/mlcommons@cm4mlops/script/build-docker-image/run.sh from tmp-run.sh [+] Building 79587.4s (16/17) docker:rootless [+] Building 79588.1s (16/17) docker:rootless [+] Building 79588.4s (16/17) docker:rootless [+] Building 79588.7s (16/17) docker:rootless [+] Building 79588.9s (16/17) docker:rootless [+] Building 79589.2s (16/17) docker:rootless [+] Building 79589.5s (16/17) docker:rootless [+] Building 79590.8s (16/17) docker:rootless [+] Building 79591.3s (16/17) docker:rootless [+] Building 79591.6s (16/17) docker:rootless [+] Building 79591.7s (16/17) docker:rootless [+] Building 79592.0s (16/17) docker:rootless [+] Building 79632.0s (17/17) FINISHED docker:rootless => [internal] load build definition from mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfil 0.0s => => transferring dockerfile: 3.03kB 0.0s => WARN: SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "CM_GH_TOKEN") (line 14) 0.0s => [internal] load metadata for nvcr.io/nvidia/mlperf/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-pub 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 45B 0.0s => [ 1/14] FROM nvcr.io/nvidia/mlperf/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public 0.0s => CACHED [ 2/14] RUN apt-get update -y 0.0s => CACHED [ 3/14] RUN apt-get install -y python3 python3-pip git sudo wget python3-venv 0.0s => CACHED [ 4/14] RUN ln -snf /usr/share/zoneinfo/US/Pacific /etc/localtime && echo US/Pacific >/etc/timezone 0.0s => CACHED [ 5/14] RUN groupadd -g 1001 -o cm 0.0s => CACHED [ 6/14] RUN useradd -m -u 1001 -g 1001 -o --create-home --shell /bin/bash cmuser 0.0s => CACHED [ 7/14] RUN echo "cmuser ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers 0.0s => CACHED [ 8/14] WORKDIR /home/cmuser 0.0s => CACHED [ 9/14] RUN python3 -m venv cm-venv 0.0s => CACHED [10/14] RUN . cm-venv/bin/activate 0.0s => CACHED [11/14] RUN python3 -m pip install --user cmind requests giturlparse tabulate 0.0s => CACHED [12/14] RUN cm pull repo mlcommons@cm4mlops --branch=mlperf-inference 0.0s => CACHED [13/14] RUN cm run script --tags=get,sys-utils-cm --quiet 0.0s => CANCELED [14/14] RUN cm run script --tags=app,mlperf,inference,generic,_nvidia,_resnet50,_tensorrt,_cuda,_test,_r4.1 79632.0s

1 warning found (use docker --debug to expand):

SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "CM_GH_TOKEN") (line 14)
ERROR: failed to solve: Canceled: context canceled

CM error: Portable CM script failed (name = build-docker-image, return code = 2)
^C

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you! (cm) tomcat@tomcat-Dove-Product:~$

Bob123Yang commented 2 months ago

Sorry, it should be said that building docker will stop at cloning the git at 14/14 for a long time over 12 hrs so that I have to stop the command by press "Ctrl + c".

arjunsuresh commented 2 months ago

will stop at cloning the git

Sorry, I'm unable to see this part in the shared output. Can you please share the number of cores and the RAM of the system?

Nvidia 4.0 code needs pytorch build from src and it typically takes around 2 hours on a 24 cores 64G system. If this is a problem, the best option is to use Nvidia 4.1 code which we are currently working on. Hope to make this available within a week.

Bob123Yang commented 2 months ago

tomcat@tomcat-Dove-Product:~$ lscpu | grep "socket\|Socket"
Core(s) per socket:                   56
Socket(s):                            2
tomcat@tomcat-Dove-Product:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi       4.8Gi       119Gi        41Mi       1.4Gi       119Gi
Swap:           49Gi          0B        49Gi
tomcat@tomcat-Dove-Product:~$

Bob123Yang commented 2 months ago

Total 112 physical cores and 64G*2 memory.

Bob123Yang commented 2 months ago

@arjunsuresh Please refer to the running log as below (try it again today) that stopped at 21% of downloading resnet50_v1.onnx within docker building and last 21336.8s without any downloading progress.

(cm) tomcat@tomcat-Dove-Product:~$ cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev \ --model=resnet50 \ --implementation=nvidia \ --framework=tensorrt \ --category=edge \ --scenario=Offline \ --execution_mode=test \ --device=cuda \ --docker --quiet \ --test_query_count=1000

arjunsuresh commented 2 months ago

I believe it could be a network issue - best to restart the command if download hangs like this. zenodo download is slow but it works 99% of the time as we have this resnet50 download in most of our github actions. Ideally this download should be over within a couple of minutes.

Bob123Yang commented 2 months ago

Yes, after several times of try run, resnet50 or other downloading passed but still stopped at the "Cloning into 'repo' ..." as before. (refer to the 1st picture)

I tried the command "git clone https://github.com/GATEOverflow/inference_results_v4.0.git --depth 5 repo" out of the docker and downloading is normal at first and last to about 20% of downloading progress and then the error prompted. (refer to the 2nd log)

The 1st picture:

The 2nd log:

tomcat@tomcat-Dove-Product:~/bobtry$ git clone  https://github.com/GATEOverflow/inference_results_v4.0.git --depth 5 repo
Cloning into 'repo'...
remote: Enumerating objects: 71874, done.
remote: Counting objects: 100% (71874/71874), done.
remote: Compressing objects: 100% (33638/33638), done.
error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8)
error: 1705 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
tomcat@tomcat-Dove-Product:~/bobtry$

arjunsuresh commented 2 months ago

I think we should fix the download issue before proceeding with MLPerf runs as there are many more downloads needed. Since the clone is failing from github - may be best is to contact your system admin?

arjunsuresh commented 1 week ago

@Bob123Yang while testing across multiple systems we have pin pointed this error to the case where available network bandwidth is very low. One such case we have seen is while using rclone download, which chokes the network bandwidth affecting git clone of large repositories for any system on the same network.

mlcommons / inference

Fail to build the docker with rootless user #1844

(cm) tomcat@tomcat-Dove-Product:~$ cm pull repo

(cm) tomcat@tomcat-Dove-Product:~$ cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev --model=resnet50 --implementation=nvidia --framework=tensorrt --category=edge --scenario=Offline --execution_mode=test --device=cuda --docker --quiet --test_query_count=1000

INFO:root: * cm run script "build docker image"