triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
704 stars 104 forks source link

[Feature request] Batch inference encounter error: Expected batch dimension to be 1 for each request for input tensor input_ids. #163

Open CN-COTER opened 11 months ago

CN-COTER commented 11 months ago

First of all, thx for amazing work!

I want to get several answers via only 1 request.

When use Triton-fastertransformer-backend, i just dump the input for batch_size times. And then do batch inference , it works well and i can get batch_size different answers. However, when i make same inference instance upon triton-tensorrtllm-backend, i get the following error

Cannot process new request: Expected batch dimension to be 1 for each request for input tensor input_ids

Model llama2-13B

Engine build script:

python build.py --model_dir=model_path --use_inflight_batching --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --enable_context_fmha --output_dir output_path --use_weight_only --max_batch_size 4 --weight_only_precision int4

Will batch-inference on trt-llm-backend be supported in the future?

Whats more, can i get several answers via only 1 request with any other methods (not batch-inference with low efficiency)?

ncomly-nvidia commented 11 months ago

Hi, @CN-COTER , so you are looking to pass a batch_size number of requests to TRT-LLM backend at once, correct? Right now you can have BS >1, but we assume one request arrives at a time - the inflight batcher will handle batching individual requests together.

We are looking into relaxing this constraint as well.

CN-COTER commented 11 months ago

Hi, @CN-COTER , so you are looking to pass a batch_size number of requests to TRT-LLM backend at once, correct? Right now you can have BS >1, but we assume one request arrives at a time - the inflight batcher will handle batching individual requests together.

We are looking into relaxing this constraint as well.

Thx for reply. I have a question that if i send BS requests to backend concurrently, triton backend executes inference for BS times or just once ? If execute for BS times, it does not make full use of GPU memory. If execute just once with some feature like dynamic-batch, it is acceptable.

ncomly-nvidia commented 11 months ago

When you pass requests to the Triton backend it will execute them concurrently up to the max concurrency specified. This could be up to BS concurrent request, but each individual request is only executed once.

ekagra-ranjan commented 11 months ago

Can @ncomly-nvidia @CN-COTER share how to send concurrent request? It appears to me that only httpclient supports it and not grpcclient. Is httpclient the only option to do concurrent request?

CN-COTER commented 10 months ago

Any update with this feature request? @ncomly-nvidia