Open LanceB57 opened 3 months ago
I'm testing now, and I'm able to get some decent results with 10 concurrent prompts (~700 tokens per second). However, when I add an 11th, I get the following error:
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::validateInputBindings::1753] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::validateInputBindings::1753, condition: profileMinDims.d[i] <= dimensions.d[i] Supplied binding dimension [20] for bindings[45] exceed min ~ max range at index 0, maximum dimension in profile is 512, minimum dimension in profile is 256, but supplied dimension is 20.)
I'm guessing that this is an issue regarding the number of tokens involved, as changing max_tokens
from 1024
to 128
allows this to run error-free. However, this error message eventually pops up again as I add more concurrent requests. How might I resolve it?
It looks your requests exceed the limitation you set during building engine. Could you share the scripts to build engine?
Sure:
python3 examples/llama/convert_checkpoint.py --model_dir meta-llama/Meta-Llama-3-8B-Instruct --output_dir ./llama_ckpt --dtype float16 --tp_size 1
trtllm-build --checkpoint_dir llama_ckpt --model_config llama_config.json --strongly_typed --output_dir ./llama_engine --max_batch_size 2048 --max_input_len 2048 --max_output_len 4096 --workers 8 --max_num_tokens 2048 --use_paged_context_fmha enable --multiple_profiles enable
I was actually able to resolve this problem by changing max_attention_window_size
on the Triton Inference Server from 2560
to 4096
. Could you give some insight on why this is though?
Also, in the above trtllm-build
, what's the difference between max_output_len
and max_num_tokens
?
System Info
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Follow this benchmarking article to create the Llama3 8B Instruct model engine (but also using a converted checkpoint for
trtllm-build
. Then, finish setup similar to this blogExpected behavior
After following the benchmarking article, I get that throughput is ~10,000 tokens per second. I hope to achieve this on a Triton Inference Server.
actual behavior
Instead, a single request has throughput around 80 tokens per second, and I'm struggling to figure out how to efficiently manage concurrent requests. I'm hoping to achieve the 10,000 tokens per second that the GPU/engine have shown it is capable of reaching as per the benchmark.
additional notes
Ultimately, I think this comes down to needing a better understanding of Triton and TensorRT-LLM (e.g., parameters, model configs. etc...) which I admittedly don't have. Suggestions for how to create an environment similar to the benchmark would be greatly appreciated.
I am aware of Triton's
instance_group
model parameter that would allow me to spawn multiple instance of the same model, but it seems like I'm limited to only three on one GPU due to memory constraints. This would only achieve 3 * 80 = 240 tokens per sec, which is, again, far from what I'm hoping for.