TllmXqaJit runtime error when build Yi-6B fp8 with TRTLLM-0.10.0.dev2024050700

System Info

GPU:RTX4090 OS:docker(tensorrt-llm make to produce image) TensorRT-LLM version: 0.10.0.dev2024050700 driver:535.171.04 CUDA Version: 12.4

Who can help?

@byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python quantize.py --model_dir Yi-6B/ --qformat fp8 --kv_cache_dtype fp8 --output_dir test_ckpt
trtllm-build --checkpoint_dir test_ckpt --output_dir test_engine --strongly_typed

Expected behavior

build success

actual behavior

[05/13/2024-07:00:20] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [05/13/2024-07:00:20] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [05/13/2024-07:01:41] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [05/13/2024-07:01:41] [TRT] [I] Detected 14 inputs and 1 output network tensors. terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] TllmXqaJit runtime error in tllmXqaJitCreateAndCompileProgram(&program, &context): NVRTC Internal Error (/src/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/compileEngine.cpp:65) 1 0x7fcba6c4e5b4 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6935b4) [0x7fcba6c4e5b4] 2 0x7fcba6d8ca59 tensorrt_llm::kernels::jit::CompileEngine::compile() const + 169 3 0x7fcba6d8e63b tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::getCubin(tensorrt_llm::kernels::XQAKernelFullHashKey const&, tensorrt_llm::kernels::jit::CompileEngine*) + 267 4 0x7fcba6d8e077 tensorrt_llm::kernels::DecoderXQAImplJIT::prepare(tensorrt_llm::kernels::XQAParams const&) + 87 5 0x7fcb6aa94efb /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xbdefb) [0x7fcb6aa94efb] 6 0x7fcb6aab140d /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xda40d) [0x7fcb6aab140d] 7 0x7fcbc9cbbf38 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xd87f38) [0x7fcbc9cbbf38] 8 0x7fcbc9cbc85c /usr/local/tensorrt/lib/libnvinfer.so.10(+0xd8885c) [0x7fcbc9cbc85c] 9 0x7fcbc9d35caf /usr/local/tensorrt/lib/libnvinfer.so.10(+0xe01caf) [0x7fcbc9d35caf] 10 0x7fcbc9d0e4e0 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xdda4e0) [0x7fcbc9d0e4e0] 11 0x7fcbc9d1507c /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde107c) [0x7fcbc9d1507c] 12 0x7fcbc9d17071 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde3071) [0x7fcbc9d17071] 13 0x7fcbc995c61c /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2861c) [0x7fcbc995c61c] 14 0x7fcbc9961837 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2d837) [0x7fcbc9961837] 15 0x7fcbc99621af /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2e1af) [0x7fcbc99621af] 16 0x7fcbd78a6478 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0xa6478) [0x7fcbd78a6478] 17 0x7fcbd78457a3 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x457a3) [0x7fcbd78457a3]

additional notes

Yi-9B also encountered the same problem.

triton-inference-server / tensorrtllm_backend