API Usage Error: profileMinDims.d[i] <= dimensions.d[i]

LanceB57 commented 3 months ago

System Info

1x H100
Llama3 8B Instruct
TensorRT-LLM v0.10.0
tensorrtllm_backend v0.10.0
tritonserver 24.06

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Follow this benchmarking article to create the Llama3 8B Instruct model engine (but also using a converted checkpoint for trtllm-build. Then, finish setup similar to this blog

Expected behavior

After following the benchmarking article, I get that throughput is ~10,000 tokens per second. I hope to achieve this on a Triton Inference Server.

actual behavior

Instead, a single request has throughput around 80 tokens per second, and I'm struggling to figure out how to efficiently manage concurrent requests. I'm hoping to achieve the 10,000 tokens per second that the GPU/engine have shown it is capable of reaching as per the benchmark.

additional notes

Ultimately, I think this comes down to needing a better understanding of Triton and TensorRT-LLM (e.g., parameters, model configs. etc...) which I admittedly don't have. Suggestions for how to create an environment similar to the benchmark would be greatly appreciated.

I am aware of Triton's instance_group model parameter that would allow me to spawn multiple instance of the same model, but it seems like I'm limited to only three on one GPU due to memory constraints. This would only achieve 3 * 80 = 240 tokens per sec, which is, again, far from what I'm hoping for.

LanceB57 commented 3 months ago

I'm testing now, and I'm able to get some decent results with 10 concurrent prompts (~700 tokens per second). However, when I add an 11th, I get the following error: [TensorRT-LLM][ERROR] 3: [executionContext.cpp::validateInputBindings::1753] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::validateInputBindings::1753, condition: profileMinDims.d[i] <= dimensions.d[i] Supplied binding dimension [20] for bindings[45] exceed min ~ max range at index 0, maximum dimension in profile is 512, minimum dimension in profile is 256, but supplied dimension is 20.)

I'm guessing that this is an issue regarding the number of tokens involved, as changing max_tokens from 1024 to 128allows this to run error-free. However, this error message eventually pops up again as I add more concurrent requests. How might I resolve it?

byshiue commented 3 months ago

It looks your requests exceed the limitation you set during building engine. Could you share the scripts to build engine?

LanceB57 commented 3 months ago

Sure:

python3 examples/llama/convert_checkpoint.py --model_dir meta-llama/Meta-Llama-3-8B-Instruct --output_dir ./llama_ckpt --dtype float16 --tp_size 1
trtllm-build --checkpoint_dir llama_ckpt --model_config llama_config.json --strongly_typed --output_dir ./llama_engine --max_batch_size 2048 --max_input_len 2048 --max_output_len 4096 --workers 8 --max_num_tokens 2048 --use_paged_context_fmha enable --multiple_profiles enable

I was actually able to resolve this problem by changing max_attention_window_size on the Triton Inference Server from 2560 to 4096. Could you give some insight on why this is though?

Also, in the above trtllm-build, what's the difference between max_output_len and max_num_tokens?

triton-inference-server / tensorrtllm_backend