triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
703 stars 104 forks source link

Segmentation fault in tritonserver streaming inference with TensorRT Baichuan model #88

Open yingjie1011 opened 1 year ago

yingjie1011 commented 1 year ago

Description I deployed a triton backend of Baichuan TensorRT engine successfully, but got segmentation fault error during streaming inference

Triton Information I start the triton container with docker image: nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

To Reproduce

  1. The Baichuan model repository I built the TensorRT-LLM engine with https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/baichuan/build.py using the following command

    python build.py --model_dir /model --dtype bfloat16 --max_batch_size 1 --use_gemm_plugin bfloat16 --use_gpt_attention_plugin bfloat16 --output_dir ./mbs-4-1024-1024

  2. The Triton server I prepared the model_repo follow https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0/all_models/inflight_batcher_llm Then deployed the triton-trt-llm backend using the following command

    tritonserver --model-repository=/tensorrtllm_backend/triton_model_repo

It seems that the server has been deployed successfully

I1102 07:56:45.433572 1358 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f863e000000' with size 268435456
I1102 07:56:45.438607 1358 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1102 07:56:45.438625 1358 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
W1102 07:56:45.999011 1358 model_lifecycle.cc:108] ignore version directory 'tokenizer' which fails to convert to integral number
I1102 07:56:45.999065 1358 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1102 07:56:45.999187 1358 model_lifecycle.cc:461] loading: preprocessing:1
I1102 07:56:45.999296 1358 model_lifecycle.cc:461] loading: postprocessing:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I1102 07:56:47.963264 1358 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1102 07:56:48.950674 1358 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1102 07:56:50.256275 1358 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1102 07:56:53.314363 1358 model_lifecycle.cc:818] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 26513 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 27793, GPU 27017 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 27795, GPU 27027 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +26505, now: CPU 0, GPU 26505 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 27803, GPU 27499 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 27803, GPU 27507 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 26505 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 27882, GPU 27549 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 27882, GPU 27559 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 26505 (MiB) I1102 07:57:06.322534 1358 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I1102 07:57:06.324335 1358 model_lifecycle.cc:461] loading: ensemble:1
I1102 07:57:06.324626 1358 model_lifecycle.cc:818] successfully loaded 'ensemble'
I1102 07:57:06.324712 1358 server.cc:592] +------------------+------+
Repository Agent Path +------------------+------+
+------------------+------+
I1102 07:57:06.324786 1358 server.cc:619]
+-------------+----------------------------------------+----------------------------------------+
Backend Path Config +-------------+----------------------------------------+----------------------------------------+ tensorrtllm /opt/tritonserver/backends/tensorrtllm {"cmdline":{"auto-complete-config":"tr /libtriton_tensorrtllm.so ue","backend-directory":"/opt/tritonse rver/backends","min-compute-capability ":"6.000000","default-max-batch-size": "4"}}
python /opt/tritonserver/backends/python/libt {"cmdline":{"auto-complete-config":"tr
riton_python.so ue","backend-directory":"/opt/tritonse
rver/backends","min-compute-capability
":"6.000000","default-max-batch-size":
"4"}}

+-------------+----------------------------------------+----------------------------------------+
I1102 07:57:06.324863 1358 server.cc:662]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
+----------------+---------+--------+
I1102 07:57:06.382222 1358 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A100-PCIE-40GB
I1102 07:57:06.382265 1358 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A100-PCIE-40GB
I1102 07:57:06.382903 1358 metrics.cc:710] Collecting CPU metrics
I1102 07:57:06.383059 1358 tritonserver.cc:2458] +----------------------------------+--------------------------------------------------------------+
| Option | Value |
+----------------------------------+--------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(un |
| | load_dependents) schedule_policy modelconfiguration system |
| | shared_memory cuda_shared_memory binary_tensor_data paramete |
| | rs statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+--------------------------------------------------------------+
I1102 07:57:06.385226 1358 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1102 07:57:06.385549 1358 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I1102 07:57:06.428002 1358 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002

  1. Streaming inference client I started the client with https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/inflight_batcher_llm/client/inflight_batcher_llm_client.py script using the following command

    python3 inflight_batcher_llm_client.py --tokenizer_dir=/model --streaming

Then got error in client

Received an error from server: [StatusCode.UNAVAILABLE] Socket closed output_ids = [[27569, 1374, 8482, 63, 32087, 7212, 92323, 1394, 66763, 13597, 1449, 1346]] Input: Born in north-east France, Soyer trained as a Output:

Meanwhile the server crashed

Signal (11) received.
0# 0x000055F2E74E513D in tritonserver
1# 0x00007F8697BA2520 in /lib/x86_64-linux-gnu/libc.so.6
2# 0x00007F860E87DBB0 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
3# 0x00007F860E880852 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
4# 0x00007F860E89D159 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
5# 0x00007F860E8A0E05 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
6# 0x00007F860E86004F in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
7# 0x00007F860E83B241 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
8# 0x00007F860E83C38A in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
9# 0x00007F8697E64253 in /lib/x86_64-linux-gnu/libstdc++.so.6
10# 0x00007F8697BF4AC3 in /lib/x86_64-linux-gnu/libc.so.6
11# clone in /lib/x86_64-linux-gnu/libc.so.6
Segmentation fault

How can I fix this error?

0xymoro commented 1 year ago

This error happens on llama models too so can confirm it's model-agnostic (or at least affects many models). It happens consistently when client closes the connection.

Kevinddddddd commented 1 year ago

+1, same error.

Hap-Zhang commented 1 year ago

+1 same error

byshiue commented 11 months ago

Could you try on latest main branch https://github.com/triton-inference-server/tensorrtllm_backend/tree/main, the commit is https://github.com/triton-inference-server/tensorrtllm_backend/commit/37ed967216bdfa0ffce038d368675c93966172ea.

SamsonPh commented 11 months ago

Has anyone solved this problem yet? I am facing the same error with LLaMA 7b.

phlo46 commented 11 months ago

+1 same error on LLaMa 7b.

byshiue commented 11 months ago

Have you set the decoupled to be true in https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L31-L33?

SamsonPh commented 11 months ago

yes, this is my config `name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 1

model_transaction_policy { decoupled: true }

input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "input_lengths" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ 1 ] }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "pad_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "min_length" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "streaming" data_type: TYPE_BOOL dims: [ 1 ] optional: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind : KIND_CPU } ] parameters: { key: "max_beam_width" value: { string_value: "1" } } parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value: "no" } } parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1" } } parameters: { key: "max_tokens_in_paged_kv_cache" value: { string_value: "${max_tokens_in_paged_kv_cache}" } } parameters: { key: "batch_scheduler_policy" value: { string_value: "${batch_scheduler_policy}" } } parameters: { key: "kv_cache_free_gpu_mem_fraction" value: { string_value: "${kv_cache_free_gpu_mem_fraction}" } } parameters: { key: "max_num_sequences" value: { string_value: "${max_num_sequences}" } } parameters: { key: "enable_trt_overlap" value: { string_value: "${enable_trt_overlap}" } } `

byshiue commented 11 months ago

Here is document of Baichuan model https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/baichuan.md for different cases. Please try following the scripts on latest main branch again.

alwayshalffull commented 10 months ago

I'm running into the same issue here as well -- I rebuilt my engine and launched in the latest version of the backend server (both v0.7.1), still consistently getting segfaults when the client closes the connection during streaming. I'm running Llama-2-70b