Closed bprus closed 1 month ago
I also have the same issue, any update?
For the code https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/tools/utils/utils.py#L384, the value of avg_in_flight_requests
is never updated.
Hey, that value is not implemented today in the code, and is hard coded to 0. This does not mean IFB is not active. We'll try to I would suggest removing dynamic batching and preferred_batch_size from the triton config. If you'd like, you can inspect per-iteration statistics (https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#triton-metrics is probably easiest), which will tell you how many prompt and generation requests are in each iteration. Having > 0 of both in any iteration is conclusive evidence of IFB working.
System Info
x86_64
v0.8.0
(docker build viamake -C docker release_build CUDA_ARCHS="86-real"
)r24.02
(docker from NGC)Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I follow official examples for Llama model: https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0/examples/llama I'm able to set everything up, and everything runs smoothly. I have inflight batching turned on in the model.
However, when I run
benchmark_core_model.py
:I get:
I wonder why
Avg. InFlight requests
is0.0
. Do I need to set anything to use inflight batching?I build the model with:
Triton Server logs:
The logs suggest that there is some kind of batching
"Active Request Count":64
and"Scheduled Requests":64
.Please help me and recommend how I can correctly verify if inflight batching is enabled and working as expected.
Expected behavior
Working inflight batching.
actual behavior
I'm not sure if inflight batching is working as expected.
additional notes
My model config: