Open CN-COTER opened 11 months ago
Hi, @CN-COTER , so you are looking to pass a batch_size
number of requests to TRT-LLM backend at once, correct? Right now you can have BS >1, but we assume one request arrives at a time - the inflight batcher will handle batching individual requests together.
We are looking into relaxing this constraint as well.
Hi, @CN-COTER , so you are looking to pass a
batch_size
number of requests to TRT-LLM backend at once, correct? Right now you can have BS >1, but we assume one request arrives at a time - the inflight batcher will handle batching individual requests together.We are looking into relaxing this constraint as well.
Thx for reply. I have a question that if i send BS requests to backend concurrently, triton backend executes inference for BS times or just once ? If execute for BS times, it does not make full use of GPU memory. If execute just once with some feature like dynamic-batch, it is acceptable.
When you pass requests to the Triton backend it will execute them concurrently up to the max concurrency specified. This could be up to BS
concurrent request, but each individual request is only executed once.
Can @ncomly-nvidia @CN-COTER share how to send concurrent request? It appears to me that only httpclient supports it and not grpcclient. Is httpclient the only option to do concurrent request?
Any update with this feature request? @ncomly-nvidia
First of all, thx for amazing work!
I want to get several answers via only 1 request.
When use Triton-fastertransformer-backend, i just dump the input for batch_size times. And then do batch inference , it works well and i can get batch_size different answers. However, when i make same inference instance upon triton-tensorrtllm-backend, i get the following error
Cannot process new request: Expected batch dimension to be 1 for each request for input tensor input_ids
Model llama2-13B
Engine build script:
python build.py --model_dir=model_path --use_inflight_batching --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --enable_context_fmha --output_dir output_path --use_weight_only --max_batch_size 4 --weight_only_precision int4
Will batch-inference on trt-llm-backend be supported in the future?
Whats more, can i get several answers via only 1 request with any other methods (not batch-inference with low efficiency)?