Signal (11) received, when using tensorRT-LLM backend to deploy baichuan2 model.

Kevinddddddd commented 11 months ago

Description I used triton inference server with trt-llm backend to deploy Baichuan2, but got errors when sending requests.

Triton Information 23.10-trtllm-python-py3

Are you using the Triton container or did you build it yourself? I used the offical Triton container.

To Reproduce My device is 1 A800. I used the following command to build the baichuan2 engine.

CUDA_VISIBLE_DEVICES=7 python build.py --model_dir /model/finetune_intent2_1103_ep-2 \
                --model_version v2_7b \
                --dtype bfloat16 \
                --world_size 1 \
                --max_batch_size 64 \
                --max_input_len 128 \
                --max_output_len 128 \
                --use_gpt_attention_plugin bfloat16 \
                --use_gemm_plugin bfloat16 \
                --output_dir /output_model/baichuan2/trt_engines/1-gpu \
                --remove_input_padding \
                --use_inflight_batching \
                --parallel_build \
                --paged_kv_cache \
                --use_weight_only \
                --weight_only_precision int8

I started the server by using the following command: sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus="device=7" -v /home/administrator/mnt/data/trt-llm/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo_non_streaming

When sending requests, the server crashed and got the following error:

I1109 05:53:36.065524 690 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1109 05:53:36.065726 690 http_server.cc:4497] Started HTTPService at 0.0.0.0:7863
I1109 05:53:36.115471 690 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
Signal (11) received.
 0# 0x000055831D0DF13D in /opt/tritonserver/bin/tritonserver
 1# 0x00007FBB355A2520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007FBAB287DBB0 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
 3# 0x00007FBAB2849E07 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
 4# 0x00007FBAB2851008 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
 5# 0x00007FBAB2851722 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
 6# 0x00007FBAB283B241 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
 7# 0x00007FBAB283C38A in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
 8# 0x00007FBB35864253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 9# 0x00007FBB355F4AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
10# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

krishung5 commented 10 months ago

Hi @Kevinddddddd, could you try building the container following Option 3 and see if the segfault still happens?

nnshah1 commented 10 months ago

@Kevinddddddd - also what kind of requests are you sending - are you sending requests to the generate endpoint or using grpc or using the standard infer endpoint?

Kevinddddddd commented 10 months ago

@Kevinddddddd - also what kind of requests are you sending - are you sending requests to the generate endpoint or using grpc or using the standard infer endpoint?

I used the standard infer

Kevinddddddd commented 10 months ago

Hi @Kevinddddddd, could you try building the container following Option 3 and see if the segfault still happens?

OK, I will try. I find when I use float16 to build the engine, the segfault don't happen again.

triton-inference-server / server

Signal (11) received, when using tensorRT-LLM backend to deploy baichuan2 model. #6548