triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
713 stars 108 forks source link

Inference server stalling #573

Open siddhatiwari opened 3 months ago

siddhatiwari commented 3 months ago

System Info

After roughly 30 seconds of inference requests, the inference server stalls, not responding to any requests. There are no error codes or crashes visible in logs. The server is using decoupled mode with dynamic_batching.

These are the parameters for the engine used:

python3 ../quantization/quantize.py --model_dir ./llama2-70b \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./llama2-70b-out \
                                   --calib_size 512 \
                                   --tp_size 2

CUDA_VISIBLE_DEVICES=6,7 trtllm-build --checkpoint_dir ./llama2-70b-out  \
             --output_dir ./llama2-70b-eng \
             --gemm_plugin float16 \
             --max_batch_size 160 \
             --max_input_len 2048 \
             --max_seq_len 2560 \
             --context_fmha enable \
             --gpt_attention_plugin float16 \
             --paged_kv_cache enable \
             --remove_input_padding enable \
             --max_num_tokens 65536 \
             --enable_xqa enable \
             --bert_context_fmha_fp32_acc enable \
             --workers 2 \
             --multiple_profiles enable \
             --use_fp8_context_fmha enable

Who can help?

No response

Information

Tasks

Reproduction

  1. Build the inference server docker image
  2. Build the llama 70b engine
  3. Start the server serving the engine
  4. Send high requests per second with 2k context length

Expected behavior

Inference server doesn't stall

actual behavior

Inference server stalls

additional notes

Initial requests complete successfully, so not sure why it stalls afterwards

dhruvmullick commented 3 months ago

How are you serving the request? Are you launching using the official launch_triton_server.py file?

michaelroyzen commented 3 months ago

I am experiencing this too @dhruvmullick. It started with the 0.13 dev version, all 0.12 versions work just fine. This is an issue with the server itself, not with the way it is launched. It probabilistically hangs with 100% GPU utilization even though no requests are inflight.

siddhatiwari commented 2 months ago

Are there any updates on this? Still experiencing this on the latest TRT LLM and backend versions

@dhruvmullick I'm launching using the tritonserver CLI and using .stream_infer() from the grpc client library to send requests: https://github.com/triton-inference-server/client/blob/cb9ba08b3f88dff802485f0577b008cdbf41c529/src/python/library/tritonclient/grpc/aio/__init__.py#L688

siddhatiwari commented 2 months ago

These might also be the same issue: https://github.com/triton-inference-server/tensorrtllm_backend/issues/574 https://github.com/triton-inference-server/tensorrtllm_backend/issues/596

michaelroyzen commented 2 weeks ago

This now seems to be fixed as of the November 5th update to the main branch.