mlcommons / cm4mlops

A collection of portable, reusable and cross-platform automation recipes (CM scripts) with a human-friendly interface and minimal dependencies to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data sets, software and hardware (cloud/edge)
http://docs.mlcommons.org/cm4mlops/
Apache License 2.0
7 stars 12 forks source link

KeyError while running NVIDIA GPT-J implementation #90

Open anandhu-eng opened 5 days ago

anandhu-eng commented 5 days ago

The command to reproduce:

cm run script --tags=run-mlperf,inference,_find-performance,_full    --model=gptj-99    --implementation=nvidia    --framework=tensorrt    --category=edge    --scenario=Offline    --execution_mode=test    --device=cuda     --docker --quiet    --test_query_count=50

Output:

* cm run script "run-mlperf inference _find-performance _full"

  * cm run script "get mlcommons inference src"
       ! load /home/anandhu/CM/repos/local/cache/08f829c532784225/cm-cached-state.json

  * cm run script "get sut description"

    * cm run script "detect os"
           ! cd /home/anandhu
           ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/run.sh from tmp-run.sh
           ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/customize.py

    * cm run script "detect cpu"

      * cm run script "detect os"
             ! cd /home/anandhu
             ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/run.sh from tmp-run.sh
             ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/customize.py
           ! cd /home/anandhu
           ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-cpu/run.sh from tmp-run.sh
           ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-cpu/customize.py

    * cm run script "get python3"
         ! load /home/anandhu/CM/repos/local/cache/fc0768f669bd4605/cm-cached-state.json

Path to Python: /home/anandhu/CM/repos/local/cache/0011e46a023746ae/berttest/bin/python3
Python version: 3.12.3

    * cm run script "get compiler"
         ! load /home/anandhu/CM/repos/local/cache/4600294f81924f42/cm-cached-state.json

    * cm run script "get cuda-devices"

      * cm run script "get cuda _toolkit"
           ! load /home/anandhu/CM/repos/local/cache/042d9cdee6644854/cm-cached-state.json

ENV[CM_CUDA_PATH_LIB_CUDNN_EXISTS]: no
ENV[CM_CUDA_VERSION]: 12.4
ENV[CM_CUDA_VERSION_STRING]: cu124
ENV[CM_NVCC_BIN_WITH_PATH]: /home/anandhu/CM/repos/local/cache/347380df9f6c468b/install/bin/nvcc
ENV[CUDA_HOME]: /home/anandhu/CM/repos/local/cache/347380df9f6c468b/install

           ! cd /home/anandhu
           ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-cuda-devices/run.sh from tmp-run.sh
./tmp-run.sh: line 3: /home/anandhu/CM/repos/local/cache/0011e46a023746ae/berttest/bin/activate: No such file or directory
rm: cannot remove 'a.out': No such file or directory

Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Compiling program ...

Running program ...

/home/anandhu
           ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-cuda-devices/customize.py
GPU Device ID: 0
GPU Name: NVIDIA GeForce RTX 4090
GPU compute capability: 8.9
CUDA driver version: 12.2
CUDA runtime version: 12.4
Global memory: 25393692672
Max clock rate: 2520.000000 MHz
Total amount of shared memory per block: 49152
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block: 1024
Max dimension size of a thread block X: 1024
Max dimension size of a thread block Y: 1024
Max dimension size of a thread block Z: 64
Max dimension size of a grid size X: 2147483647
Max dimension size of a grid size Y: 65535
Max dimension size of a grid size Z: 65535

    * cm run script "get generic-python-lib _package.dmiparser"
         ! load /home/anandhu/CM/repos/local/cache/3927fb01e4e34fde/cm-cached-state.json

    * cm run script "get cache dir _name.mlperf-inference-sut-descriptions"
         ! load /home/anandhu/CM/repos/local/cache/ad9c97dbf28c462a/cm-cached-state.json
Generating SUT description file for intel_spr_i9
         ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-sut-description/customize.py

  * cm run script "install pip-package for-cmind-python _package.tabulate"
       ! load /home/anandhu/CM/repos/local/cache/066fa0e1608f4b34/cm-cached-state.json

  * cm run script "get mlperf inference utils"

    * cm run script "get mlperf inference src"
         ! load /home/anandhu/CM/repos/local/cache/08f829c532784225/cm-cached-state.json
         ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-utils/customize.py
Using MLCommons Inference source from /home/anandhu/CM/repos/local/cache/aa85cd9ada244ffd/inference

Running loadgen scenario: Offline and mode: performance

* cm run script "build dockerfile"

Dockerfile generated at /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/app-mlperf-inference/dockerfiles/mlperf-inference:mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public.Dockerfile

* cm run script "get docker"

  * cm run script "detect os"
         ! cd /home/anandhu/CM/repos/local/cache/23342dac54164d0b
         ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/run.sh from tmp-run.sh
         ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/customize.py

    * /usr/bin/docker
           ! cd /home/anandhu/CM/repos/local/cache/23342dac54164d0b
           ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-docker/run.sh from tmp-run.sh
           ! call "detect_version" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-docker/customize.py
    Detected version: 26.1.3

    # Found artifact in /usr/bin/docker
       ! cd /home/anandhu/CM/repos/local/cache/23342dac54164d0b
       ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-docker/run.sh from tmp-run.sh
       ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-docker/customize.py
    Detected version: 26.1.3

* cm run script "get mlperf inference results dir"
     ! load /home/anandhu/CM/repos/local/cache/f885f8230069430f/cm-cached-state.json

* cm run script "get mlperf inference submission dir"
       ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-submission-dir/customize.py

* cm run script "get ml-model gptj _nvidia _fp8"

  * cm run script "get git repo _repo.https://github.com/NVIDIA/TensorRT-LLM.git _sha.0ab9d17a59c284d2de36889832fe9fc7c8697604"

    * cm run script "detect os"
           ! cd /home/anandhu/CM/repos/local/cache/d1922606fb914e26
           ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/run.sh from tmp-run.sh
           ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/detect-os/customize.py
         ! cd /home/anandhu/CM/repos/local/cache/d1922606fb914e26
         ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
******************************************************
Current directory: /home/anandhu/CM/repos/local/cache/d1922606fb914e26

Cloning TensorRT-LLM.git from https://github.com/NVIDIA/TensorRT-LLM.git

git clone  --recurse-submodules https://github.com/NVIDIA/TensorRT-LLM.git  repo

Cloning into 'repo'...
remote: Enumerating objects: 19137, done.
remote: Counting objects: 100% (8972/8972), done.
remote: Compressing objects: 100% (2436/2436), done.
remote: Total 19137 (delta 7153), reused 7602 (delta 6503), pack-reused 10165
Receiving objects: 100% (19137/19137), 285.26 MiB | 19.03 MiB/s, done.
Resolving deltas: 100% (13996/13996), done.
Updating files: 100% (2402/2402), done.
Filtering content: 100% (14/14), 212.15 MiB | 32.99 MiB/s, done.
Submodule '3rdparty/NVTX' (https://github.com/NVIDIA/NVTX.git) registered for path '3rdparty/NVTX'
Submodule '3rdparty/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path '3rdparty/cutlass'
Submodule '3rdparty/cxxopts' (https://github.com/jarro2783/cxxopts) registered for path '3rdparty/cxxopts'
Submodule '3rdparty/json' (https://github.com/nlohmann/json.git) registered for path '3rdparty/json'
Cloning into '/home/anandhu/CM/repos/local/cache/d1922606fb914e26/repo/3rdparty/NVTX'...
remote: Enumerating objects: 2424, done.
remote: Counting objects: 100% (770/770), done.
remote: Compressing objects: 100% (219/219), done.
remote: Total 2424 (delta 553), reused 638 (delta 508), pack-reused 1654
Receiving objects: 100% (2424/2424), 2.68 MiB | 9.33 MiB/s, done.
Resolving deltas: 100% (1374/1374), done.
Cloning into '/home/anandhu/CM/repos/local/cache/d1922606fb914e26/repo/3rdparty/cutlass'...
remote: Enumerating objects: 26714, done.
remote: Counting objects: 100% (25/25), done.
remote: Compressing objects: 100% (23/23), done.
remote: Total 26714 (delta 5), reused 10 (delta 0), pack-reused 26689
Receiving objects: 100% (26714/26714), 42.66 MiB | 14.04 MiB/s, done.
Resolving deltas: 100% (20054/20054), done.
Cloning into '/home/anandhu/CM/repos/local/cache/d1922606fb914e26/repo/3rdparty/cxxopts'...
remote: Enumerating objects: 1877, done.
remote: Counting objects: 100% (212/212), done.
remote: Compressing objects: 100% (44/44), done.
remote: Total 1877 (delta 186), reused 168 (delta 168), pack-reused 1665
Receiving objects: 100% (1877/1877), 691.80 KiB | 3.26 MiB/s, done.
Resolving deltas: 100% (1106/1106), done.
Cloning into '/home/anandhu/CM/repos/local/cache/d1922606fb914e26/repo/3rdparty/json'...
remote: Enumerating objects: 38219, done.
remote: Counting objects: 100% (101/101), done.
remote: Compressing objects: 100% (56/56), done.
remote: Total 38219 (delta 50), reused 73 (delta 33), pack-reused 38118
Receiving objects: 100% (38219/38219), 185.18 MiB | 18.15 MiB/s, done.
Resolving deltas: 100% (23471/23471), done.
Submodule path '3rdparty/NVTX': checked out 'a1ceb0677f67371ed29a2b1c022794f077db5fe7'
Submodule path '3rdparty/cutlass': checked out '7d49e6c7e2f8896c47f586706e67e1fb215529dc'
Submodule path '3rdparty/cxxopts': checked out 'eb787304d67ec22f7c3a184ee8b4c481d04357fd'
Submodule path '3rdparty/json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d'

git checkout -b 0ab9d17a59c284d2de36889832fe9fc7c8697604 0ab9d17a59c284d2de36889832fe9fc7c8697604
Updating files: 100% (2713/2713), done.
Filtering content: 100% (4/4), 7.33 MiB | 4.33 MiB/s, done.
M       3rdparty/cutlass
Switched to a new branch '0ab9d17a59c284d2de36889832fe9fc7c8697604'
         ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-git-repo/customize.py

CM cache path to the Git repo: /home/anandhu/CM/repos/local/cache/d1922606fb914e26/repo

  * cm run script "get cuda"
       ! load /home/anandhu/CM/repos/local/cache/113e9cb12a914b88/cm-cached-state.json

ENV[CM_CUDA_PATH_LIB_CUDNN_EXISTS]: yes
ENV[CM_CUDA_VERSION]: 12.4
ENV[CM_CUDA_VERSION_STRING]: cu124
ENV[CM_NVCC_BIN_WITH_PATH]: /home/anandhu/CM/repos/local/cache/347380df9f6c468b/install/bin/nvcc
ENV[CUDA_HOME]: /home/anandhu/CM/repos/local/cache/347380df9f6c468b/install

  * cm run script "get nvidia scratch space"
         ! cd /home/anandhu/CM/repos/local/cache/83dfa55dc9de45dc
         ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-nvidia-scratch-space/run.sh from tmp-run.sh
         ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-nvidia-scratch-space/customize.py

  * cm run script "get cuda-devices"

    * cm run script "get cuda _toolkit"
         ! load /home/anandhu/CM/repos/local/cache/042d9cdee6644854/cm-cached-state.json

ENV[CM_CUDA_PATH_LIB_CUDNN_EXISTS]: no
ENV[CM_CUDA_VERSION]: 12.4
ENV[CM_CUDA_VERSION_STRING]: cu124
ENV[CM_NVCC_BIN_WITH_PATH]: /home/anandhu/CM/repos/local/cache/347380df9f6c468b/install/bin/nvcc
ENV[CUDA_HOME]: /home/anandhu/CM/repos/local/cache/347380df9f6c468b/install

         ! cd /home/anandhu/CM/repos/local/cache/2e5566ea2fc648d4
         ! call /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-cuda-devices/run.sh from tmp-run.sh
rm: cannot remove 'a.out': No such file or directory

Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Compiling program ...

Running program ...

/home/anandhu/CM/repos/local/cache/2e5566ea2fc648d4
         ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-cuda-devices/customize.py
GPU Device ID: 0
GPU Name: NVIDIA GeForce RTX 4090
GPU compute capability: 8.9
CUDA driver version: 12.2
CUDA runtime version: 12.4
Global memory: 25393692672
Max clock rate: 2520.000000 MHz
Total amount of shared memory per block: 49152
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block: 1024
Max dimension size of a thread block X: 1024
Max dimension size of a thread block Y: 1024
Max dimension size of a thread block Z: 64
Max dimension size of a grid size X: 2147483647
Max dimension size of a grid size Y: 65535
Max dimension size of a grid size Z: 65535

  * cm run script "get ml-model gpt-j _fp32 _pytorch"
       ! load /home/anandhu/CM/repos/local/cache/54a457e3e708400c/cm-cached-state.json

Path to the ML model: None

  * cm run script "get nvidia inference common-code"

    * cm run script "get mlperf inference results"
         ! load /home/anandhu/CM/repos/local/cache/f885f8230069430f/cm-cached-state.json
         ! call "postprocess" from /home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-nvidia-common-code/customize.py
Traceback (most recent call last):
  File "/home/anandhu/.local/bin/cm", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/anandhu/.local/lib/python3.12/site-packages/cmind/cli.py", line 37, in run
    r = cm.access(argv, out='con')
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/.local/lib/python3.12/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
        ^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 1490, in _run
    r = customize_code.preprocess(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/run-mlperf-inference-app/customize.py", line 219, in preprocess
    r = cm.access(ii)
        ^^^^^^^^^^^^^
  File "/home/anandhu/.local/lib/python3.12/site-packages/cmind/core.py", line 758, in access
    return cm.access(i)
           ^^^^^^^^^^^^
  File "/home/anandhu/.local/lib/python3.12/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
        ^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 4109, in docker
    return utils.call_internal_module(self, __file__, 'module_misc', 'docker', i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/.local/lib/python3.12/site-packages/cmind/utils.py", line 1631, in call_internal_module
    return getattr(tmp_module, module_func)(i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module_misc.py", line 1817, in docker
    r = script_automation._run_deps(deps, [], env, {}, {}, {}, {}, '', [], '', False, '', verbose, show_time, ' ', run_state)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 3080, in _run_deps
    r = self.cmind.access(ii)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/.local/lib/python3.12/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
        ^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 1380, in _run
    r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 2909, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 3080, in _run_deps
    r = self.cmind.access(ii)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/.local/lib/python3.12/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
        ^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 1568, in _run
    r = prepare_and_run_script_with_postprocessing(run_script_input)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 4733, in prepare_and_run_script_with_postprocessing
    rr = run_postprocess(customize_code, customize_common_input, recursion_spaces, env, state, const,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/automation/script/module.py", line 4785, in run_postprocess
    r = customize_code.postprocess(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhu/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-nvidia-common-code/customize.py", line 16, in postprocess
    env['CM_MLPERF_INFERENCE_NVIDIA_CODE_PATH'] = os.path.join(env['CM_MLPERF_INFERENCE_RESULTS_PATH'], "closed", "NVIDIA")
                                                               ~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'CM_MLPERF_INFERENCE_RESULTS_PATH'
arjunsuresh commented 5 days ago

Seems to be working fine for me. Can you try

cm rm cache --tags=inference,results -f
anandhu-eng commented 1 day ago

Hi @arjunsuresh, still it's there. I have made sure that I have code synced up to date with main repo.