Closed Kevinddddddd closed 8 months ago
Hi @Kevinddddddd, could you try building the container following Option 3 and see if the segfault still happens?
@Kevinddddddd - also what kind of requests are you sending - are you sending requests to the generate endpoint or using grpc or using the standard infer endpoint?
@Kevinddddddd - also what kind of requests are you sending - are you sending requests to the generate endpoint or using grpc or using the standard infer endpoint?
I used the standard infer
Hi @Kevinddddddd, could you try building the container following Option 3 and see if the segfault still happens?
OK, I will try. I find when I use float16 to build the engine, the segfault don't happen again.
Description I used triton inference server with trt-llm backend to deploy Baichuan2, but got errors when sending requests.
Triton Information 23.10-trtllm-python-py3
Are you using the Triton container or did you build it yourself? I used the offical Triton container.
To Reproduce My device is 1 A800. I used the following command to build the baichuan2 engine.
I started the server by using the following command:
sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus="device=7" -v /home/administrator/mnt/data/trt-llm/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo_non_streaming
When sending requests, the server crashed and got the following error: