Open zhaoxjmail opened 12 months ago
I have same error TensorRT-LLM version is latest main branch tensorrtllm_backend version is latest main brach
python3 build.py --model_dir mistral_base --dtype float16 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--output_dir "saved_model/1-gpu" \
--max_batch_size "8" --max_input_len 32256 --max_output_len 16384 \
--use_rmsnorm_plugin float16 \
--enable_context_fmha --remove_input_padding \
--use_inflight_batching --paged_kv_cache \
--max_num_tokens "2048"
I have same error TensorRT-LLM version is latest main branch tensorrtllm_backend version is latest main brach
python3 build.py --model_dir mistral_base --dtype float16 \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --output_dir "saved_model/1-gpu" \ --max_batch_size "8" --max_input_len 32256 --max_output_len 16384 \ --use_rmsnorm_plugin float16 \ --enable_context_fmha --remove_input_padding \ --use_inflight_batching --paged_kv_cache \ --max_num_tokens "2048"
I solve it this error raised by main branch. change tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1 .it's ok
I have same error TensorRT-LLM version is latest main branch tensorrtllm_backend version is latest main brach
python3 build.py --model_dir mistral_base --dtype float16 \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --output_dir "saved_model/1-gpu" \ --max_batch_size "8" --max_input_len 32256 --max_output_len 16384 \ --use_rmsnorm_plugin float16 \ --enable_context_fmha --remove_input_padding \ --use_inflight_batching --paged_kv_cache \ --max_num_tokens "2048"
I solve it this error raised by main branch. change tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1 .it's ok
Thank you .i will try later
@moseshu Which image do you use? I use the nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
, tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1. I still have this error.
The issue is often caused by building engine on one version of code, and run engine on another version of code.
So, please try rebuilding the full trt llm with latest code, building engine and run inference.
@byshiue You mean the image nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
is only suitable for 0.5.0-release, and image nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3
is suitable for 0.6.0 or higher release?
I successfully built trt-llm-llama engine and triton server with this image nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
(0.5.0-release). But failed with 0.6.0 or higher release.
When you want to run serving on backend docker image with 0.5.0-release, you need to build the engine by tensorrt-llm 0.5.0.
When you want to run serving on backend docker image with 0.6.0-release, you need to build the engine by tensorrt-llm 0.6.0.
I mean you might build docker on 0.5.0, and run it on 0.5.0 and 0.6.0 and encoutner error in latter case.
@moseshu Which image do you use? I use the
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
, tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1. I still have this error.
I use tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1 ,nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3
When you want to run serving on backend docker image with 0.5.0-release, you need to build the engine by tensorrt-llm 0.5.0.
When you want to run serving on backend docker image with 0.6.0-release, you need to build the engine by tensorrt-llm 0.6.0.
I mean you might build docker on 0.5.0, and run it on 0.5.0 and 0.6.0 and encoutner error in latter case.
I use nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3, Tensorrt-LLM main branch ,tensorrtllm_backend main branch, but it does't work. I change the both verseions to v0.6.1. it works fine. why?
You might not rebuild the trt llm or trt llm backend on main branch successfully.
@byshiue - is there a document or a table, that shows the mapping from the tritonserver NGC container to what tag
(s) work with that particular container version? For example, something like this:
Triton server tag TensorRT-LLM tag tensorrtllm_backend tag
23.10-trtllm-python-py3 v0.5.0 v0.5.0
23.11-trtllm-python-py3 v0.6.0 v0.6.0
23.12-trtllm-python-py3 v0.7.0 v0.7.0
I think such a document/section in the tensorrtllm_backend/ repo README would help out a lot with answering these versioning related questions (particularly the question: "I build my engine with TRT-LLM tag X -- now what pre-built Triton server containers will my engine work with?").
@kelkarn
Please refer support matrix in https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html. Also, please don't ask same questions in many different issues.
I'm running into this same issue even though I built my engine with the latest version of TensorRT-LLM and am using the latest container for TensorRT-LLM, nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3. I am using int4_awq mistral.
I using docker version: nvcr.io/nvidia/tritonserver 23.10-trtllm-python-py3 run follow script: python scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
occur error:
[TensorRT-LLM][INFO] Loaded engine size: 15448 MiB [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:320) 1 0x7f6ce803be0b /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x35e0b) [0x7f6ce803be0b] 2 0x7f6ce809746c tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const, unsigned long) + 908 3 0x7f6ce80a9b2d tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const, unsigned long) + 13 4 0x7f6ce80a9b72 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const, void const, unsigned long) + 50 5 0x7f6d1ff13c36 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a0c36) [0x7f6d1ff13c36] 6 0x7f6d1ff22a8e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10afa8e) [0x7f6d1ff22a8e] 7 0x7f6d1fead737 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x103a737) [0x7f6d1fead737] 8 0x7f6d1feab81e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x103881e) [0x7f6d1feab81e] 9 0x7f6d1fec252b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x104f52b) [0x7f6d1fec252b] 10 0x7f6d1fec4fa2 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1051fa2) [0x7f6d1fec4fa2] 11 0x7f6d1fec537c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x105237c) [0x7f6d1fec537c] 12 0x7f6d1fef7051 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1084051) [0x7f6d1fef7051] 13 0x7f6d1fef7e17 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1084e17) [0x7f6d1fef7e17] 14 0x7f6d2e8d5594 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed594) [0x7f6d2e8d5594] 15 0x7f6d2e84ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f6d2e84ef4e] 16 0x7f6d2e83ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f6d2e83ec0c] 17 0x7f6d2e8395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f6d2e8395f5] 18 0x7f6d2e8374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f6d2e8374db] 19 0x7f6d2e81b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f6d2e81b182] 20 0x7f6d2e81b235 TRITONBACKEND_ModelInstanceInitialize + 101 21 0x7f6e2319aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f6e2319aa86] 22 0x7f6e2319bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f6e2319bcc6] 23 0x7f6e2317ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f6e2317ec15] 24 0x7f6e2317f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f6e2317f256] 25 0x7f6e2318b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f6e2318b27d] 26 0x7f6e227f9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f6e227f9ee8] 27 0x7f6e2317597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f6e2317597b] 28 0x7f6e23185695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f6e23185695]