Open yingjie1011 opened 1 year ago
This error happens on llama models too so can confirm it's model-agnostic (or at least affects many models). It happens consistently when client closes the connection.
+1, same error.
+1 same error
Could you try on latest main branch https://github.com/triton-inference-server/tensorrtllm_backend/tree/main, the commit is https://github.com/triton-inference-server/tensorrtllm_backend/commit/37ed967216bdfa0ffce038d368675c93966172ea.
Has anyone solved this problem yet? I am facing the same error with LLaMA 7b.
+1 same error on LLaMa 7b.
Have you set the decoupled
to be true in https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L31-L33?
yes, this is my config `name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 1
model_transaction_policy { decoupled: true }
input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "input_lengths" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ 1 ] }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "pad_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "min_length" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "streaming" data_type: TYPE_BOOL dims: [ 1 ] optional: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind : KIND_CPU } ] parameters: { key: "max_beam_width" value: { string_value: "1" } } parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value: "no" } } parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1" } } parameters: { key: "max_tokens_in_paged_kv_cache" value: { string_value: "${max_tokens_in_paged_kv_cache}" } } parameters: { key: "batch_scheduler_policy" value: { string_value: "${batch_scheduler_policy}" } } parameters: { key: "kv_cache_free_gpu_mem_fraction" value: { string_value: "${kv_cache_free_gpu_mem_fraction}" } } parameters: { key: "max_num_sequences" value: { string_value: "${max_num_sequences}" } } parameters: { key: "enable_trt_overlap" value: { string_value: "${enable_trt_overlap}" } } `
Here is document of Baichuan model https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/baichuan.md for different cases. Please try following the scripts on latest main branch again.
I'm running into the same issue here as well -- I rebuilt my engine and launched in the latest version of the backend server (both v0.7.1), still consistently getting segfaults when the client closes the connection during streaming. I'm running Llama-2-70b
Description I deployed a triton backend of Baichuan TensorRT engine successfully, but got segmentation fault error during streaming inference
Triton Information I start the triton container with docker image: nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
To Reproduce
The Baichuan model repository I built the TensorRT-LLM engine with https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/baichuan/build.py using the following command
The Triton server I prepared the model_repo follow https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0/all_models/inflight_batcher_llm Then deployed the triton-trt-llm backend using the following command
It seems that the server has been deployed successfully
Then got error in client
Meanwhile the server crashed
How can I fix this error?