triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
703 stars 104 forks source link

Segmentation fault in tritonserver streaming inference with TensorRT Baichuan model #88

Open yingjie1011 opened 1 year ago

yingjie1011 commented 1 year ago

Description I deployed a triton backend of Baichuan TensorRT engine successfully, but got segmentation fault error during streaming inference

Triton Information I start the triton container with docker image:

To Reproduce

  1. The Baichuan model repository I built the TensorRT-LLM engine with using the following command

    python --model_dir /model --dtype bfloat16 --max_batch_size 1 --use_gemm_plugin bfloat16 --use_gpt_attention_plugin bfloat16 --output_dir ./mbs-4-1024-1024

  2. The Triton server I prepared the model_repo follow Then deployed the triton-trt-llm backend using the following command

    tritonserver --model-repository=/tensorrtllm_backend/triton_model_repo

It seems that the server has been deployed successfully

I1102 07:56:45.433572 1358] Pinned memory pool is created at '0x7f863e000000' with size 268435456
I1102 07:56:45.438607 1358] CUDA memory pool is created on device 0 with size 67108864
I1102 07:56:45.438625 1358] CUDA memory pool is created on device 1 with size 67108864
W1102 07:56:45.999011 1358] ignore version directory 'tokenizer' which fails to convert to integral number
I1102 07:56:45.999065 1358] loading: tensorrt_llm:1
I1102 07:56:45.999187 1358] loading: preprocessing:1
I1102 07:56:45.999296 1358] loading: postprocessing:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I1102 07:56:47.963264 1358] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1102 07:56:48.950674 1358] successfully loaded 'postprocessing'
I1102 07:56:50.256275 1358] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1102 07:56:53.314363 1358] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 26513 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 27793, GPU 27017 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 27795, GPU 27027 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +26505, now: CPU 0, GPU 26505 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 27803, GPU 27499 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 27803, GPU 27507 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 26505 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 27882, GPU 27549 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 27882, GPU 27559 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 26505 (MiB) I1102 07:57:06.322534 1358] successfully loaded 'tensorrt_llm'
I1102 07:57:06.324335 1358] loading: ensemble:1
I1102 07:57:06.324626 1358] successfully loaded 'ensemble'
I1102 07:57:06.324712 1358] +------------------+------+
Repository Agent Path +------------------+------+
I1102 07:57:06.324786 1358]
Backend Path Config +-------------+----------------------------------------+----------------------------------------+ tensorrtllm /opt/tritonserver/backends/tensorrtllm {"cmdline":{"auto-complete-config":"tr / ue","backend-directory":"/opt/tritonse rver/backends","min-compute-capability ":"6.000000","default-max-batch-size": "4"}}
python /opt/tritonserver/backends/python/libt {"cmdline":{"auto-complete-config":"tr ue","backend-directory":"/opt/tritonse

I1102 07:57:06.324863 1358]
| Model | Version | Status |
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
I1102 07:57:06.382222 1358] Collecting metrics for GPU 0: NVIDIA A100-PCIE-40GB
I1102 07:57:06.382265 1358] Collecting metrics for GPU 1: NVIDIA A100-PCIE-40GB
I1102 07:57:06.382903 1358] Collecting CPU metrics
I1102 07:57:06.383059 1358] +----------------------------------+--------------------------------------------------------------+
| Option | Value |
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(un |
| | load_dependents) schedule_policy modelconfiguration system |
| | shared_memory cuda_shared_memory binary_tensor_data paramete |
| | rs statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
I1102 07:57:06.385226 1358] Started GRPCInferenceService at
I1102 07:57:06.385549 1358] Started HTTPService at
I1102 07:57:06.428002 1358] Started Metrics Service at

  1. Streaming inference client I started the client with script using the following command

    python3 --tokenizer_dir=/model --streaming

Then got error in client

Received an error from server: [StatusCode.UNAVAILABLE] Socket closed output_ids = [[27569, 1374, 8482, 63, 32087, 7212, 92323, 1394, 66763, 13597, 1449, 1346]] Input: Born in north-east France, Soyer trained as a Output:

Meanwhile the server crashed

Signal (11) received.
0# 0x000055F2E74E513D in tritonserver
1# 0x00007F8697BA2520 in /lib/x86_64-linux-gnu/
2# 0x00007F860E87DBB0 in /opt/tritonserver/backends/tensorrtllm/
3# 0x00007F860E880852 in /opt/tritonserver/backends/tensorrtllm/
4# 0x00007F860E89D159 in /opt/tritonserver/backends/tensorrtllm/
5# 0x00007F860E8A0E05 in /opt/tritonserver/backends/tensorrtllm/
6# 0x00007F860E86004F in /opt/tritonserver/backends/tensorrtllm/
7# 0x00007F860E83B241 in /opt/tritonserver/backends/tensorrtllm/
8# 0x00007F860E83C38A in /opt/tritonserver/backends/tensorrtllm/
9# 0x00007F8697E64253 in /lib/x86_64-linux-gnu/
10# 0x00007F8697BF4AC3 in /lib/x86_64-linux-gnu/
11# clone in /lib/x86_64-linux-gnu/
Segmentation fault

How can I fix this error?

0xymoro commented 1 year ago

This error happens on llama models too so can confirm it's model-agnostic (or at least affects many models). It happens consistently when client closes the connection.

Kevinddddddd commented 1 year ago

+1, same error.

Hap-Zhang commented 1 year ago

+1 same error

byshiue commented 11 months ago

Could you try on latest main branch, the commit is

SamsonPh commented 11 months ago

Has anyone solved this problem yet? I am facing the same error with LLaMA 7b.

phlo46 commented 11 months ago

+1 same error on LLaMa 7b.

byshiue commented 11 months ago

Have you set the decoupled to be true in

SamsonPh commented 11 months ago

yes, this is my config `name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 1

model_transaction_policy { decoupled: true }

input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "input_lengths" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ 1 ] }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "pad_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "min_length" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "streaming" data_type: TYPE_BOOL dims: [ 1 ] optional: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind : KIND_CPU } ] parameters: { key: "max_beam_width" value: { string_value: "1" } } parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value: "no" } } parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1" } } parameters: { key: "max_tokens_in_paged_kv_cache" value: { string_value: "${max_tokens_in_paged_kv_cache}" } } parameters: { key: "batch_scheduler_policy" value: { string_value: "${batch_scheduler_policy}" } } parameters: { key: "kv_cache_free_gpu_mem_fraction" value: { string_value: "${kv_cache_free_gpu_mem_fraction}" } } parameters: { key: "max_num_sequences" value: { string_value: "${max_num_sequences}" } } parameters: { key: "enable_trt_overlap" value: { string_value: "${enable_trt_overlap}" } } `

byshiue commented 11 months ago

Here is document of Baichuan model for different cases. Please try following the scripts on latest main branch again.

alwayshalffull commented 10 months ago

I'm running into the same issue here as well -- I rebuilt my engine and launched in the latest version of the backend server (both v0.7.1), still consistently getting segfaults when the client closes the connection during streaming. I'm running Llama-2-70b