triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
650 stars 93 forks source link

Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found #186

Open liuwqiang opened 9 months ago

liuwqiang commented 9 months ago

chatglm2 will report an error on the latest version branch image

byshiue commented 9 months ago

What do you mean for latest version? The latest main branch or latest release branch?

MasterJH5574 commented 9 months ago

Hitting this issue for Llama2. The command I used to build the llama engine is

python examples/llama/build.py \
--model_dir /models/Llama-2-7b-chat-hf/ \
--dtype float16 \
--use_inflight_batching \
--paged_kv_cache \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--world_size=1 \
--output_dir ./tmp/llama-7B-fp16-1-gpu/

After building the engine and copying files to triton_model_repo/tensorrt_llm/1 (as described in https://github.com/triton-inference-server/tensorrtllm_backend/tree/main#create-the-model-repository), I am able to launch the server through scripts/launch_triton_server.py. The server is successfully launched I believe:

I1215 20:09:27.354057 1973 grpc_server.cc:2469] Started GRPCInferenceService at 0.0.0.0:8001
I1215 20:09:27.354177 1973 http_server.cc:4554] Started HTTPService at 0.0.0.0:8000
I1215 20:09:27.397534 1973 http_server.cc:282] Started Metrics Service at 0.0.0.0:8002

However, when I send the example request

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

(copied from https://github.com/triton-inference-server/tensorrtllm_backend/tree/main#query-the-server-with-the-triton-generate-endpoint), I hit the same issue here (a bit different in the expected shape, where mine is (-1, 2, -1)):

I1215 20:09:27.354057 1973 grpc_server.cc:2469] Started GRPCInferenceService at 0.0.0.0:8001
I1215 20:09:27.354177 1973 http_server.cc:4554] Started HTTPService at 0.0.0.0:8000
I1215 20:09:27.397534 1973 http_server.cc:282] Started Metrics Service at 0.0.0.0:8002
[TensorRT-LLM][ERROR] Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found; expected
shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124)
1       0x7fb567bfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7fb567bfc4b3]
2       0x7fb567b5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7fb567b5b1ee]
3       0x7fb567b5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7fb567b5c480]
4       0x7fb567b5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7fb567b5fc3d]
5       0x7fb567b4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7fb567b4e738]
6       0x7fb567b4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7fb567b4f905]
7       0x7fb621250253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb621250253]
8       0x7fb620fe0ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb620fe0ac3]
9       0x7fb621071bf4 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 1804289384: Encountered an error in forward function: Input tensor 'host_kv_
cache_block_pointers_0' not found; expected shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124)
1       0x7fb567bfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7fb567bfc4b3]
2       0x7fb567b5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7fb567b5b1ee]
3       0x7fb567b5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7fb567b5c480]
4       0x7fb567b5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7fb567b5fc3d]
5       0x7fb567b4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7fb567b4e738]
6       0x7fb567b4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7fb567b4f905]
7       0x7fb621250253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb621250253]
8       0x7fb620fe0ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb620fe0ac3]
9       0x7fb621071bf4 clone + 68
[TensorRT-LLM][WARNING] Step function failed, continuing.

I have no idea what happens under the hood. Could anyone help provide some pointers?

byshiue commented 8 months ago

Could you try on latest main branch?

AatroxZZ commented 8 months ago

I try on latest main branch, but the problem still occurs.

byshiue commented 8 months ago

Could you share the new log?

AatroxZZ commented 8 months ago

Could you share the new log?


/opt/tritonserver/tensorrtllm_backend# CUDA_VISIBLE_DEVICES=1 python3 scripts/launch_triton_server.py --world_size=1 --model_repo=triton_model_repo --grpc_port 6001 --http_port 6002 --metrics_port 6003
/opt/tritonserver/tensorrtllm_backend# I1218 13:52:35.719577 418025 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f512c000000' with size 268435456
I1218 13:52:35.721440 418025 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1218 13:52:35.726539 418025 model_lifecycle.cc:461] loading: postprocessing:1
I1218 13:52:35.726588 418025 model_lifecycle.cc:461] loading: preprocessing:1
I1218 13:52:35.726630 418025 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1218 13:52:35.726667 418025 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1
I1218 13:52:35.736589 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1218 13:52:35.736604 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1218 13:52:35.736664 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_1 (CPU device 0)
I1218 13:52:35.736712 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_2 (CPU device 0)
I1218 13:52:35.736735 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_1 (CPU device 0)
I1218 13:52:35.736799 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_2 (CPU device 0)
I1218 13:52:35.736819 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_4 (CPU device 0)
I1218 13:52:35.736884 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_3 (CPU device 0)
I1218 13:52:35.736908 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_5 (CPU device 0)
I1218 13:52:35.736931 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_4 (CPU device 0)
I1218 13:52:35.736967 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_6 (CPU device 0)
I1218 13:52:35.737027 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_7 (CPU device 0)
I1218 13:52:35.737028 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_5 (CPU device 0)
I1218 13:52:35.737062 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_6 (CPU device 0)
I1218 13:52:35.737142 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_7 (CPU device 0)
I1218 13:52:35.737361 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_3 (CPU device 0)
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1218 13:52:35.845203 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_3 (CPU device 0)
I1218 13:52:35.845671 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)
I1218 13:52:35.845834 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_6 (CPU device 0)
I1218 13:52:35.846194 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_4 (CPU device 0)
I1218 13:52:35.846252 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_2 (CPU device 0)
I1218 13:52:35.846644 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_5 (CPU device 0)
I1218 13:52:35.846713 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_1 (CPU device 0)
I1218 13:52:35.850800 418025 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_7 (CPU device 0)
I1218 13:52:36.347803 418025 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls'
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I1218 13:52:36.576452 418025 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1218 13:52:36.644654 418025 model_lifecycle.cc:818] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 12856 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12901, GPU 13354 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 12902, GPU 13364 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12935, GPU 15826 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 12935, GPU 15834 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12968, GPU 15854 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 12968, GPU 15864 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] Using 80000 tokens in paged KV cache.
I1218 13:52:48.174623 418025 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I1218 13:52:48.175203 418025 model_lifecycle.cc:461] loading: ensemble:1
I1218 13:52:48.175486 418025 model_lifecycle.cc:818] successfully loaded 'ensemble'
I1218 13:52:48.175569 418025 server.cc:606] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1218 13:52:48.175631 418025 server.cc:633] +-------------+----------------------------------------------------------------+----------------------------------------------------------------+ Backend Path Config +-------------+----------------------------------------------------------------+----------------------------------------------------------------+ python /opt/tritonserver/backends/python/libtriton_python.so {"cmdline":{"auto-complete-config":"false","backend-directory" :"/opt/tritonserver/backends","min-compute-capability":"6.0000 00","shm-region-prefix-name":"prefix0_","default-max-batch-siz e":"4"}} tensorrtllm /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.s {"cmdline":{"auto-complete-config":"false","backend-directory" o :"/opt/tritonserver/backends","min-compute-capability":"6.0000 00","default-max-batch-size":"4"}}

+-------------+----------------------------------------------------------------+----------------------------------------------------------------+

I1218 13:52:48.175666 418025 server.cc:676] +------------------+---------+--------+ | Model | Version | Status | +------------------+---------+--------+ | ensemble | 1 | READY | | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | READY | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------+

I1218 13:52:48.425494 418025 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB I1218 13:52:48.476124 418025 metrics.cc:710] Collecting CPU metrics I1218 13:52:48.476286 418025 tritonserver.cc:2483] +----------------------------------+------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.40.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configu | | | ration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | triton_model_repo | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+------------------------------------------------------------------------------------------------------------+

I1218 13:52:48.477597 418025 grpc_server.cc:2469] Started GRPCInferenceService at 0.0.0.0:6001 I1218 13:52:48.477826 418025 http_server.cc:4554] Started HTTPService at 0.0.0.0:6002 I1218 13:52:48.518915 418025 http_server.cc:282] Started Metrics Service at 0.0.0.0:6003 [TensorRT-LLM][ERROR] Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found; expected shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124) 1 0x7f50fbbfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7f50fbbfc4b3] 2 0x7f50fbb5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7f50fbb5b1ee] 3 0x7f50fbb5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7f50fbb5c480] 4 0x7f50fbb5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7f50fbb5fc3d] 5 0x7f50fbb4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7f50fbb4e738] 6 0x7f50fbb4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7f50fbb4f905] 7 0x7f5189250253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5189250253] 8 0x7f5188fe0ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5188fe0ac3] 9 0x7f5189071bf4 clone + 68 [TensorRT-LLM][ERROR] Encountered error for requestId 1804289384: Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found; expected shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124) 1 0x7f50fbbfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7f50fbbfc4b3] 2 0x7f50fbb5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7f50fbb5b1ee] 3 0x7f50fbb5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7f50fbb5c480] 4 0x7f50fbb5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7f50fbb5fc3d] 5 0x7f50fbb4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7f50fbb4e738] 6 0x7f50fbb4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7f50fbb4f905] 7 0x7f5189250253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5189250253] 8 0x7f5188fe0ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5188fe0ac3] 9 0x7f5189071bf4 clone + 68 [TensorRT-LLM][WARNING] Step function failed, continuing.

Command: curl POST localhost:6002/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "
max_tokens": 128, "bad_words": "", "stop_words": ""}'

```python
curl: (6) Could not resolve host: POST
{"error":"in ensemble 'ensemble', Encountered error for requestId 1804289384: Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found; expected shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124)\n1       0x7f50fbbfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7f50fbbfc4b3]\n2       0x7f50fbb5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7f50fbb5b1ee]\n3       0x7f50fbb5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7f50fbb5c480]\n4       0x7f50fbb5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7f50fbb5fc3d]\n5       0x7f50fbb4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7f50fbb4e738]\n6       0x7f50fbb4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7f50fbb4f905]\n7       0x7f5189250253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5189250253]\n8       0x7f5188fe0ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5188fe0ac3]\n9       0x7f5189071bf4 clone + 68"}
leeeeeeeee1 commented 8 months ago

I also meet the same problem with tensorrt-llm 0.6.1

snippetzero commented 8 months ago

I try on latest main branch, meet the same problem. main branch + qwen 72B + --use_weight_only

manarshehadeh commented 8 months ago

Hitting the same issue with Llama-2-70b on tensorrt-llm 0.6.1

caseylai commented 8 months ago

Same issue on release v0.7.0 + baichuan1-13b-int8-1gpu model.

root@r1:/opt/tritonserver/tensorrtllm_backend# I1228 04:07:10.673660 5024 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fc20a000000' with size 268435456
I1228 04:07:10.680567 5024 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1228 04:07:10.680578 5024 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1228 04:07:10.680580 5024 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1228 04:07:10.680582 5024 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1228 04:07:10.680584 5024 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I1228 04:07:10.680586 5024 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I1228 04:07:11.277287 5024 model_lifecycle.cc:461] loading: postprocessing:1
I1228 04:07:11.277320 5024 model_lifecycle.cc:461] loading: preprocessing:1
I1228 04:07:11.277342 5024 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1228 04:07:11.277357 5024 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1
I1228 04:07:11.291198 5024 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)
I1228 04:07:11.291231 5024 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_1 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] The batch scheduler policy will be set to guaranteed_no_evictsince the backend operates in decoupled mode
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value
[TensorRT-LLM][WARNING] Parameter pipeline_parallel cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'pipeline_parallel' not found
[TensorRT-LLM][WARNING] Parameter num_kv_heads cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_kv_heads' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] Parameter max_prompt_embedding_table_size cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_prompt_embedding_table_size' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][WARNING] Parameter gather_all_token_logits cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'gather_all_token_logits' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1228 04:07:11.370480 5024 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1228 04:07:11.370528 5024 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_1 (CPU device 0)
I1228 04:07:11.370785 5024 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1228 04:07:11.370887 5024 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_1 (CPU device 0)
I1228 04:07:11.627285 5024 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls'
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I1228 04:07:12.004828 5024 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I1228 04:07:12.013232 5024 model_lifecycle.cc:818] successfully loaded 'postprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 13285 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13377, GPU 13760 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 13379, GPU 13770 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +13280, now: CPU 0, GPU 13280 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 13413, GPU 15272 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13413, GPU 15280 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13280 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13455, GPU 15304 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +12, now: CPU 13456, GPU 15316 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13280 (MiB)
[TensorRT-LLM][INFO] Using 10880 tokens in paged KV cache.
I1228 04:07:25.528782 5024 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I1228 04:07:25.529869 5024 model_lifecycle.cc:461] loading: ensemble:1
I1228 04:07:25.530533 5024 model_lifecycle.cc:818] successfully loaded 'ensemble'
I1228 04:07:25.530643 5024 server.cc:606]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1228 04:07:25.530723 5024 server.cc:633]
+-------------+--------------------------------------------------------+--------------------------------------------------------+
| Backend     | Path                                                   | Config                                                 |
+-------------+--------------------------------------------------------+--------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so  | {"cmdline":{"auto-complete-config":"false","backend-di |
|             |                                                        | rectory":"/opt/tritonserver/backends","min-compute-cap |
|             |                                                        | ability":"6.000000","shm-region-prefix-name":"prefix0_ |
|             |                                                        | ","default-max-batch-size":"4"}}                       |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tenso | {"cmdline":{"auto-complete-config":"false","backend-di |
|             | rrtllm.so                                              | rectory":"/opt/tritonserver/backends","min-compute-cap |
|             |                                                        | ability":"6.000000","default-max-batch-size":"4"}}     |
|             |                                                        |                                                        |
+-------------+--------------------------------------------------------+--------------------------------------------------------+

I1228 04:07:25.530780 5024 server.cc:676]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+

I1228 04:07:25.694360 5024 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090
I1228 04:07:25.694402 5024 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA GeForce RTX 4090
I1228 04:07:25.694411 5024 metrics.cc:817] Collecting metrics for GPU 2: NVIDIA GeForce RTX 4090
I1228 04:07:25.694419 5024 metrics.cc:817] Collecting metrics for GPU 3: NVIDIA GeForce RTX 4090
I1228 04:07:25.694427 5024 metrics.cc:817] Collecting metrics for GPU 4: NVIDIA GeForce RTX 4090
I1228 04:07:25.694435 5024 metrics.cc:817] Collecting metrics for GPU 5: NVIDIA GeForce RTX 4090
I1228 04:07:25.699976 5024 metrics.cc:710] Collecting CPU metrics
I1228 04:07:25.700180 5024 tritonserver.cc:2483]
+----------------------------------+----------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                       |
| server_version                   | 2.40.0                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy |
|                                  |  model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters s |
|                                  | tatistics trace logging                                                                      |
| model_repository_path[0]         | triton_model_repo                                                                            |
| model_control_mode               | MODE_NONE                                                                                    |
| strict_model_config              | 1                                                                                            |
| rate_limit                       | OFF                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{4}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{5}    | 67108864                                                                                     |
| min_supported_compute_capability | 6.0                                                                                          |
| strict_readiness                 | 1                                                                                            |
| exit_timeout                     | 30                                                                                           |
| cache_enabled                    | 0                                                                                            |
+----------------------------------+----------------------------------------------------------------------------------------------+

I1228 04:07:25.702197 5024 grpc_server.cc:2469] Started GRPCInferenceService at 0.0.0.0:8001
I1228 04:07:25.702522 5024 http_server.cc:4554] Started HTTPService at 0.0.0.0:8000
I1228 04:07:25.744536 5024 http_server.cc:282] Started Metrics Service at 0.0.0.0:8002
[TensorRT-LLM][ERROR] Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found; expected shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124)
1       0x7fc137bfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7fc137bfc4b3]
2       0x7fc137b5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7fc137b5b1ee]
3       0x7fc137b5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7fc137b5c480]
4       0x7fc137b5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7fc137b5fc3d]
5       0x7fc137b4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7fc137b4e738]
6       0x7fc137b4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7fc137b4f905]
7       0x7fc263450253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc263450253]
8       0x7fc2631e0ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc2631e0ac3]
9       0x7fc263272a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7fc263272a40]
[TensorRT-LLM][ERROR] Encountered error for requestId 1804289384: Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found; expected shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124)
1       0x7fc137bfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7fc137bfc4b3]
2       0x7fc137b5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7fc137b5b1ee]
3       0x7fc137b5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7fc137b5c480]
4       0x7fc137b5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7fc137b5fc3d]
5       0x7fc137b4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7fc137b4e738]
6       0x7fc137b4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7fc137b4f905]
7       0x7fc263450253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc263450253]
8       0x7fc2631e0ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc2631e0ac3]
9       0x7fc263272a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7fc263272a40]
[TensorRT-LLM][WARNING] Step function failed, continuing.
E1228 04:07:29.378051 5024 model.py:345] Traceback (most recent call last):
  File "/opt/tritonserver/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/model.py", line 305, in execute
    raise pb_utils.TritonModelException(
c_python_backend_utils.TritonModelException: Encountered error for requestId 1804289384: Encountered an error in forward function: Input tensor 'host_kv_cache_block_pointers_0' not found; expected shape: (-1, 2, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124)
1       0x7fc137bfc4b3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1124b3) [0x7fc137bfc4b3]
2       0x7fc137b5b1ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x711ee) [0x7fc137b5b1ee]
3       0x7fc137b5c480 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72480) [0x7fc137b5c480]
4       0x7fc137b5fc3d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75c3d) [0x7fc137b5fc3d]
5       0x7fc137b4e738 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64738) [0x7fc137b4e738]
6       0x7fc137b4f905 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x65905) [0x7fc137b4f905]
7       0x7fc263450253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc263450253]
8       0x7fc2631e0ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc2631e0ac3]
9       0x7fc263272a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7fc263272a40]
david100w commented 8 months ago

Hitting the same issue with qwen-7b (bfloat16, 1-gpu) on tensorrt-llm 0.7.0

THU-mjx commented 8 months ago

Hitting the same issue with llama-7b on main branch

leeeeeeeee1 commented 8 months ago

I also meet the same problem with tensorrt-llm 0.6.1

I've solve this problem by rebuild tensorrtllm_backend & tensorrt_llm inside the docker.