triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
715 stars 108 forks source link

Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:320 #191

Open zhaoxjmail opened 12 months ago

zhaoxjmail commented 12 months ago

I using docker version: nvcr.io/nvidia/tritonserver 23.10-trtllm-python-py3 run follow script: python scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo

occur error:

[TensorRT-LLM][INFO] Loaded engine size: 15448 MiB [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:320) 1 0x7f6ce803be0b /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x35e0b) [0x7f6ce803be0b] 2 0x7f6ce809746c tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const, unsigned long) + 908 3 0x7f6ce80a9b2d tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const, unsigned long) + 13 4 0x7f6ce80a9b72 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const, void const, unsigned long) + 50 5 0x7f6d1ff13c36 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a0c36) [0x7f6d1ff13c36] 6 0x7f6d1ff22a8e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10afa8e) [0x7f6d1ff22a8e] 7 0x7f6d1fead737 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x103a737) [0x7f6d1fead737] 8 0x7f6d1feab81e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x103881e) [0x7f6d1feab81e] 9 0x7f6d1fec252b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x104f52b) [0x7f6d1fec252b] 10 0x7f6d1fec4fa2 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1051fa2) [0x7f6d1fec4fa2] 11 0x7f6d1fec537c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x105237c) [0x7f6d1fec537c] 12 0x7f6d1fef7051 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1084051) [0x7f6d1fef7051] 13 0x7f6d1fef7e17 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1084e17) [0x7f6d1fef7e17] 14 0x7f6d2e8d5594 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed594) [0x7f6d2e8d5594] 15 0x7f6d2e84ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f6d2e84ef4e] 16 0x7f6d2e83ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f6d2e83ec0c] 17 0x7f6d2e8395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f6d2e8395f5] 18 0x7f6d2e8374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f6d2e8374db] 19 0x7f6d2e81b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f6d2e81b182] 20 0x7f6d2e81b235 TRITONBACKEND_ModelInstanceInitialize + 101 21 0x7f6e2319aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f6e2319aa86] 22 0x7f6e2319bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f6e2319bcc6] 23 0x7f6e2317ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f6e2317ec15] 24 0x7f6e2317f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f6e2317f256] 25 0x7f6e2318b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f6e2318b27d] 26 0x7f6e227f9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f6e227f9ee8] 27 0x7f6e2317597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f6e2317597b] 28 0x7f6e23185695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f6e23185695]

moseshu commented 12 months ago

I have same error TensorRT-LLM version is latest main branch tensorrtllm_backend version is latest main brach

python3 build.py --model_dir mistral_base  --dtype float16 \
      --use_gpt_attention_plugin float16  \
      --use_gemm_plugin float16  \
      --output_dir "saved_model/1-gpu"  \
      --max_batch_size "8" --max_input_len 32256 --max_output_len 16384 \
      --use_rmsnorm_plugin float16  \
      --enable_context_fmha --remove_input_padding \
      --use_inflight_batching --paged_kv_cache \
      --max_num_tokens "2048"
moseshu commented 12 months ago

I have same error TensorRT-LLM version is latest main branch tensorrtllm_backend version is latest main brach

python3 build.py --model_dir mistral_base  --dtype float16 \
      --use_gpt_attention_plugin float16  \
      --use_gemm_plugin float16  \
      --output_dir "saved_model/1-gpu"  \
      --max_batch_size "8" --max_input_len 32256 --max_output_len 16384 \
      --use_rmsnorm_plugin float16  \
      --enable_context_fmha --remove_input_padding \
      --use_inflight_batching --paged_kv_cache \
      --max_num_tokens "2048"

I solve it this error raised by main branch. change tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1 .it's ok

zhaoxjmail commented 12 months ago

I have same error TensorRT-LLM version is latest main branch tensorrtllm_backend version is latest main brach

python3 build.py --model_dir mistral_base  --dtype float16 \
      --use_gpt_attention_plugin float16  \
      --use_gemm_plugin float16  \
      --output_dir "saved_model/1-gpu"  \
      --max_batch_size "8" --max_input_len 32256 --max_output_len 16384 \
      --use_rmsnorm_plugin float16  \
      --enable_context_fmha --remove_input_padding \
      --use_inflight_batching --paged_kv_cache \
      --max_num_tokens "2048"

I solve it this error raised by main branch. change tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1 .it's ok

Thank you .i will try later

THU-mjx commented 11 months ago

@moseshu Which image do you use? I use the nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3, tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1. I still have this error.

byshiue commented 11 months ago

The issue is often caused by building engine on one version of code, and run engine on another version of code.

So, please try rebuilding the full trt llm with latest code, building engine and run inference.

THU-mjx commented 11 months ago

@byshiue You mean the image nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 is only suitable for 0.5.0-release, and image nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3 is suitable for 0.6.0 or higher release? I successfully built trt-llm-llama engine and triton server with this image nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 (0.5.0-release). But failed with 0.6.0 or higher release.

byshiue commented 11 months ago

When you want to run serving on backend docker image with 0.5.0-release, you need to build the engine by tensorrt-llm 0.5.0.

When you want to run serving on backend docker image with 0.6.0-release, you need to build the engine by tensorrt-llm 0.6.0.

I mean you might build docker on 0.5.0, and run it on 0.5.0 and 0.6.0 and encoutner error in latter case.

moseshu commented 11 months ago

@moseshu Which image do you use? I use the nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3, tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1. I still have this error.

I use tensorRT-LLM=v0.6.1 and tensorrtllm_backend=v0.6.1 ,nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3

moseshu commented 11 months ago

When you want to run serving on backend docker image with 0.5.0-release, you need to build the engine by tensorrt-llm 0.5.0.

When you want to run serving on backend docker image with 0.6.0-release, you need to build the engine by tensorrt-llm 0.6.0.

I mean you might build docker on 0.5.0, and run it on 0.5.0 and 0.6.0 and encoutner error in latter case.

I use nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3, Tensorrt-LLM main branch ,tensorrtllm_backend main branch, but it does't work. I change the both verseions to v0.6.1. it works fine. why?

byshiue commented 11 months ago

You might not rebuild the trt llm or trt llm backend on main branch successfully.

kelkarn commented 10 months ago

@byshiue - is there a document or a table, that shows the mapping from the tritonserver NGC container to what tag(s) work with that particular container version? For example, something like this:

Triton server tag                  TensorRT-LLM tag          tensorrtllm_backend tag
23.10-trtllm-python-py3            v0.5.0                    v0.5.0
23.11-trtllm-python-py3            v0.6.0                    v0.6.0
23.12-trtllm-python-py3            v0.7.0                    v0.7.0

I think such a document/section in the tensorrtllm_backend/ repo README would help out a lot with answering these versioning related questions (particularly the question: "I build my engine with TRT-LLM tag X -- now what pre-built Triton server containers will my engine work with?").

byshiue commented 10 months ago

@kelkarn

Please refer support matrix in https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html. Also, please don't ask same questions in many different issues.

Broyojo commented 9 months ago

I'm running into this same issue even though I built my engine with the latest version of TensorRT-LLM and am using the latest container for TensorRT-LLM, nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3. I am using int4_awq mistral.