I'm trying to use Triton to deploy baichuan2-13B inference under bf16 precision. The tritonserver can be started successfully, but when processing client request, it crashed.
Use TensorRT-LLM v0.5.0 to build the engine
The following shows the configuration I used when building the engine
python build.py --model_version v2_13b --model_dir /mnt/Baichuan2-13B-Chat/ --dtype bfloat16 --use_gemm_plugin bfloat16 --use_gpt_attention_plugin bfloat16 --output_dir /mnt/trt_engine/baichuan2-13B/1-gpu/
When send a curl request, the server crashed and got following error:
Hope to get some effective suggestions to solve this problem.
I'm trying to use Triton to deploy baichuan2-13B inference under bf16 precision. The tritonserver can be started successfully, but when processing client request, it crashed.
The following shows the configuration I used when building the engine
python build.py --model_version v2_13b --model_dir /mnt/Baichuan2-13B-Chat/ --dtype bfloat16 --use_gemm_plugin bfloat16 --use_gpt_attention_plugin bfloat16 --output_dir /mnt/trt_engine/baichuan2-13B/1-gpu/
When send a curl request, the server crashed and got following error:Hope to get some effective suggestions to solve this problem.