Open calvinh99 opened 9 months ago
I'm running the same exact issue, any timeline for the fix?
Currently, the C++ Triton backend only accepts batch size 1 requests. We use in-flight batching to create larger batches from those batch size 1 requests. We don't have a timeline for supporting batch size > 1 requests with in-flight batching.
any timeline for the fi
So you mean I cannot do splitting a paragraph of text into a batch of sentences? This kind of request would fail right?
You could just send multiple requests, each request containing a single sentence.
System Info
Hardware:
Libraries:
Latest commit:
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
)Who can help?
@juney-nvidia @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The problem is encountered when running the triton inference server on the docker container.
tensorrtllm_backend/tensorrt_llm/examples/llama/build.py
(used Mistral 7B Instruct weights)tensorrt_llm
modelThe
triton_model_repo/tensorrt_llm/1/config.pbtxt
:my script
def test_batched_request(): max_input_len = 3096 max_output_len = 512
I've searched again and again, but couldn't find info anywhere on this error
[TensorRT-LLM][ERROR] Assertion failed: Expected batch dimension to be 1 for each request for input_ids (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:234)
Expected behavior
I expect that it should take the inputs with batch_size more than 1 (it was 6 in this case) and perform batched inference.
When I tested batched inference inside the docker container using
tensorrtllm_backend/tensorrt_llm/examples/run.py
using the same tensorrt engine, it worked. But the triton inference server isn't?For reference this is how I ran it inside the docker container (not via triton inference server). This was inside dir
tensorrtllm_backend/tensorrt_llm/examples
.It worked fine here.
I'm really not sure why this error happens, couldn't find anything about it anywhere.
actual behavior
The triton server throws error expecting batch dimension to be 1.
additional notes
Some additional issues that I'm not too sure why they occur (I fixed these by just changing the config.pbtxt, but don't have real understanding of why).
When I set the
"max_tokens_in_paged_kv_cache"
parameter in my config to 8192, the server started treating my inputs as incorrectly shaped (even though I made no other changes). When I change it back to 4096, everything works fine again.Also, the documentation for batched requests points to the example script
tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py
.But the default code seems to force batch size to 1, check here.
The other parts of this example script were very helpful, just couldn't find any clues to inference beyond a batch size of 1.