Build failed with cuda runtime error.

Device specs

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:00:05.0 Off |                    0 |
| N/A   31C    P0              59W / 400W |    174MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

How to reproduce it

After I ran all the given commands to build the docker image, I went inside the container and ran the llama.py inside examples/ folder. And here is the error logs:

[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 12856 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 13276, GPU 14063 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13276, GPU 14071 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13276, GPU 14127 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13276, GPU 14135 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13308, GPU 14151 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 13308, GPU 14161 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13341, GPU 14179 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 13341, GPU 14189 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cudaOccupancyMaxActiveBlocksPerMultiprocessor(&num_blocks_per_sm, mmha::masked_multihead_attention_kernel<T, T_cache, KVCacheBuffer, Dh, THDS_PER_BLOCK, KernelParamsType::DO_CROSS_ATTENTION, HAS_BEAMS, DO_MULTI_BLOCK>, THDS_PER_BLOCK, 0): no kernel image is available for execution on the device (/src/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h:206)
1       0x7fc8b702a564 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x55564) [0x7fc8b702a564]
2       0x7fc8b7401622 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x42c622) [0x7fc8b7401622]
3       0x7fc8b7086545 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xb1545) [0x7fc8b7086545]
4       0x7fc8b70971f9 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xc21f9) [0x7fc8b70971f9]
5       0x7fc8b70a19dd /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xcc9dd) [0x7fc8b70a19dd]
6       0x7fc8b70a3c6a /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xcec6a) [0x7fc8b70a3c6a]
7       0x7fc8b709c7ad tensorrt_llm::plugins::GPTAttentionPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 189
8       0x7fc9116b6ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fc9116b6ba9]
9       0x7fc91168c6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fc91168c6af]
10      0x7fc91168e320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fc91168e320]
11      0x7fc8da710d2f tensorrt_llm::runtime::GptSession::executeGenerationStep(int, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, std::vector<bool, std::allocator<bool> >&) + 1903
12      0x7fc8da71261e tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 3134
13      0x7fc8da7137e1 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3105
14      0x7fc8da6aa949 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xcd949) [0x7fc8da6aa949]
15      0x7fc8da691bc7 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb4bc7) [0x7fc8da691bc7]
16      0x5566a423be0e python3(+0x15fe0e) [0x5566a423be0e]
17      0x5566a42325eb _PyObject_MakeTpCall + 603
18      0x5566a424a7bb python3(+0x16e7bb) [0x5566a424a7bb]
19      0x5566a422a8a2 _PyEval_EvalFrameDefault + 24914
20      0x5566a424a4e1 python3(+0x16e4e1) [0x5566a424a4e1]
21      0x5566a424b192 PyObject_Call + 290
22      0x5566a42272c1 _PyEval_EvalFrameDefault + 11121
23      0x5566a4315e56 python3(+0x239e56) [0x5566a4315e56]
24      0x5566a4315cf6 PyEval_EvalCode + 134
25      0x5566a43407d8 python3(+0x2647d8) [0x5566a43407d8]
26      0x5566a433a0bb python3(+0x25e0bb) [0x5566a433a0bb]
27      0x5566a4340525 python3(+0x264525) [0x5566a4340525]
28      0x5566a433fa08 _PyRun_SimpleFileObject + 424
29      0x5566a433f653 _PyRun_AnyFileObject + 67
30      0x5566a433241e Py_RunMain + 702
31      0x5566a4308cad Py_BytesMain + 45
32      0x7fca0e41bd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fca0e41bd90]
33      0x7fca0e41be40 __libc_start_main + 128
34      0x5566a4308ba5 _start + 37
[e5d1ed2681d9:00503] *** Process received signal ***
[e5d1ed2681d9:00503] Signal: Aborted (6)
[e5d1ed2681d9:00503] Signal code:  (-6)
[e5d1ed2681d9:00503] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fca0e434520]
[e5d1ed2681d9:00503] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fca0e4889fc]
[e5d1ed2681d9:00503] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fca0e434476]
[e5d1ed2681d9:00503] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fca0e41a7f3]
[e5d1ed2681d9:00503] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7fc951876b9e]
[e5d1ed2681d9:00503] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7fc95188220c]
[e5d1ed2681d9:00503] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7fc9518811e9]
[e5d1ed2681d9:00503] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7fc951881959]
[e5d1ed2681d9:00503] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7fca0cb3e884]
[e5d1ed2681d9:00503] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7fca0cb3ef41]
[e5d1ed2681d9:00503] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7fc9518824cb]
[e5d1ed2681d9:00503] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x55596)[0x7fc8b702a596]
[e5d1ed2681d9:00503] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x42c622)[0x7fc8b7401622]
[e5d1ed2681d9:00503] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xb1545)[0x7fc8b7086545]
[e5d1ed2681d9:00503] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xc21f9)[0x7fc8b70971f9]
[e5d1ed2681d9:00503] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xcc9dd)[0x7fc8b70a19dd]
[e5d1ed2681d9:00503] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xcec6a)[0x7fc8b70a3c6a]
[e5d1ed2681d9:00503] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins18GPTAttentionPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0xbd)[0x7fc8b709c7ad]
[e5d1ed2681d9:00503] [18] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9)[0x7fc9116b6ba9]
[e5d1ed2681d9:00503] [19] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af)[0x7fc91168c6af]
[e5d1ed2681d9:00503] [20] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320)[0x7fc91168e320]
[e5d1ed2681d9:00503] [21] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession21executeGenerationStepEiRKSt6vectorINS0_15GenerationInputESaIS3_EERS2_INS0_16GenerationOutputESaIS8_EERKS2_IiSaIiEEPNS_13batch_manager16kv_cache_manager14KVCacheManagerERS2_IbSaIbEE+0x76f)[0x7fc8da710d2f]
[e5d1ed2681d9:00503] [22] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession15generateBatchedERSt6vectorINS0_16GenerationOutputESaIS3_EERKS2_INS0_15GenerationInputESaIS7_EERKNS0_14SamplingConfigERKSt8functionIFvibEE+0xc3e)[0x7fc8da71261e]
[e5d1ed2681d9:00503] [23] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(_ZN12tensorrt_llm7runtime10GptSession8generateERNS0_16GenerationOutputERKNS0_15GenerationInputERKNS0_14SamplingConfigE+0xc21)[0x7fc8da7137e1]
[e5d1ed2681d9:00503] [24] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xcd949)[0x7fc8da6aa949]
[e5d1ed2681d9:00503] [25] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb4bc7)[0x7fc8da691bc7]
[e5d1ed2681d9:00503] [26] python3(+0x15fe0e)[0x5566a423be0e]
[e5d1ed2681d9:00503] [27] python3(_PyObject_MakeTpCall+0x25b)[0x5566a42325eb]
[e5d1ed2681d9:00503] [28] python3(+0x16e7bb)[0x5566a424a7bb]
[e5d1ed2681d9:00503] [29] python3(_PyEval_EvalFrameDefault+0x6152)[0x5566a422a8a2]
[e5d1ed2681d9:00503] *** End of error message ***
Aborted (core dumped)

I am also guessing, a similar issue #58 is also raised.

huggingface / optimum-nvidia

Build failed with cuda runtime error. #64

Device specs

How to reproduce it