triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
704 stars 105 forks source link

Triton Server crashed when using baichuan2-13B bf16 precision for inference #198

Open Luis-xu opened 11 months ago

Luis-xu commented 11 months ago

I'm trying to use Triton to deploy baichuan2-13B inference under bf16 precision. The tritonserver can be started successfully, but when processing client request, it crashed.

The following shows the configuration I used when building the engine python build.py --model_version v2_13b --model_dir /mnt/Baichuan2-13B-Chat/ --dtype bfloat16 --use_gemm_plugin bfloat16 --use_gpt_attention_plugin bfloat16 --output_dir /mnt/trt_engine/baichuan2-13B/1-gpu/ When send a curl request, the server crashed and got following error:

image

Hope to get some effective suggestions to solve this problem.

byshiue commented 11 months ago

Please take a try on latest main branch or v0.6.1.