Closed activezhao closed 10 months ago
Can you share how to reproduce the performance number you get? Which file do you use and what script do you use? It would be helpful to provide clear reproduced steps to help reproducing the issue.
Can you share how to reproduce the performance number you get? Which file do you use and what script do you use? It would be helpful to provide clear reproduced steps to help reproducing the issue.
@byshiue OK, I will try to give more details soon, but one key point is that the "batch_size" I use is 1.
In fact, I also use benchmark of TensorRT-LLM to test, but the result is even worse.
https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/benchmarks/python/README.md
And I use the flowing command for test:
mpirun --allow-run-as-root -n 4 python benchmark.py \
-m llama_7b \
--mode plugin \
--batch_size "1" \
--input_output_len "500,200"
The tokenPerSecond is only 91. Is there sth wrong?
I am not sure is tokenPerSecond 91
reasonable or not because I am not sure how do you compute the tokenPerSecond
(is it including the input + output? or is it only including output), and I don't understand your network topology and hardward settings, which are important for tensor parallelism.
I am not sure is
tokenPerSecond 91
reasonable or not because I am not sure how do you compute thetokenPerSecond
(is it including the input + output? or is it only including output), and I don't understand your network topology and hardward settings, which are important for tensor parallelism.
@byshiue I use A10 for test, and model is llama_7b.
the "tokens_per_sec" data is given by the TensorRT-LLM benchmark, which is:
https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/benchmarks/python/benchmark.py
and tokens_per_sec is computed as the flowing:
https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/benchmarks/python/gpt_benchmark.py#L459
The command and result is as the flowing:
mpirun --allow-run-as-root -n 4 python benchmark.py -m llama_7b --mode plugin --batch_size "1" --input_output_len "500,200"
[BENCHMARK] model_name llama_7b world_size 4 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 1 input_length 500 output_length 200 gpu_peak_mem(gb) 6.78 build_time(s) 21.03 tokens_per_sec 95.27 percentile95(ms) 2108.397 percentile99(ms) 2111.005 latency(ms) 2099.3 compute_cap sm70
Thank you for the response. The tokens_per_sec
in benchmark.py is computed by bs * output_len / second
. In such scenario, tokens_per_sec 95 looks reasonable for me under batch size 1.
You can try using larger batch size to increase the throughput.
Thank you for the response. The
tokens_per_sec
in benchmark.py is computed bybs * output_len / second
. In such scenario, tokens_per_sec 95 looks reasonable for me under batch size 1.You can try using larger batch size to increase the throughput.
@byshiue Thanks for your quick reply, in fact, I have tested with larger batch size, and the largest one of tokens_per_sec is 1129.
But I have a question, when I use endpoint to request with inflight-batching mode, just like this
curl -X POST localhost:8000/v2/models/${MODEL_NAME}/generate -d '{"{PARAM1_KEY}": "{PARAM1_VALUE}", ... }'
and the max_batch_size is 128, tp is 4, will the throughput increase as expected?
I expect. Do you have any concern?
I expect. Do you have any concern?
@byshiue I will just do more tests then, but I found that when building engines, there is a parameter named "max_batch_size", in my opinion, it is so wired, why we have to set a value now?
And if I set "max_batch_size" value to 8, but in config.pbtxt the value is 128, what will happen? Will the real "max_batch_size" is 8?
What's more, if I set "max_batch_size" value to 128, but in config.pbtxt the value is 64, what will happen? Will the real "max_batch_size" is 64? Or it will be error?
https://github.com/NVIDIA/TensorRT-LLM/blob/11e14500f35dd781b535ba009c906f55ecfee3b5/examples/llama/build.py#L158C6-L158C6
I also found that in the building engines files, there is a file named config.json, and the "max_batch_size" is here.
Then when building engines, setting the value of "max_batch_size" is just for assignment in this file. This value will not have other effects on the build. And whether the priority of the batch value in config.json is greater than config.pbtxt?
{
"builder_config": {
"fp8": false,
"hidden_act": "silu",
"hidden_size": 4096,
"int8": false,
"max_batch_size": 64,
"max_input_len": 2048,
"max_num_tokens": null,
"max_output_len": 512,
"max_position_embeddings": 16384,
"name": "llama",
"num_heads": 32,
"num_kv_heads": 32,
"num_layers": 32,
"parallel_build": false,
"pipeline_parallel": 1,
"precision": "float16",
"quant_mode": 0,
"tensor_parallel": 4,
"use_refit": false,
"vocab_size": 32016
},
"plugin_config": {
"attention_qk_half_accumulation": false,
"bert_attention_plugin": false,
"context_fmha_type": 1,
"gemm_plugin": "float16",
"gpt_attention_plugin": "float16",
"identity_plugin": false,
"layernorm_plugin": false,
"layernorm_quantization_plugin": false,
"lookup_plugin": false,
"nccl_plugin": "float16",
"paged_kv_cache": true,
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"remove_input_padding": true,
"rmsnorm_plugin": false,
"rmsnorm_quantization_plugin": false,
"smooth_quant_gemm_plugin": false,
"tokens_per_block": 64,
"use_custom_all_reduce": false,
"weight_only_groupwise_quant_matmul_plugin": false,
"weight_only_quant_matmul_plugin": false
}
}
The max_batch_size
in config.json is a hyper-parameter for the engine, which means the maximum batch size supported by this engine. It is used to compute the workspace we need during inference.
The max_batch_size
in config.pbtxt of backend means the maximum batch size of request you will recieve and send to server.
So, when the batch_size you send to server is larger than the max_batch_size
of config.json of engine, it is invalid and should be throw error.
The
max_batch_size
in config.json is a hyper-parameter for the engine, which means the maximum batch size supported by this engine. It is used to compute the workspace we need during inference.The
max_batch_size
in config.pbtxt of backend means the maximum batch size of request you will recieve and send to server.So, when the batch_size you send to server is larger than the
max_batch_size
of config.json of engine, it is invalid and should be throw error.
@byshiue Thank u so much for your detailed answer, I learnt a lot.
I just set max_batch_size
value to 64 in config.json, and I test the tokens_per_sec in our own way, and the current indicator seems to be on par with vLLM, it's about 1,000, which I think is within expectations.
Thank u again for help.
@activezhao Thank you for the update. It looks your issue is solved. Close this bug. Feel free to repoen it if needed.
I have used Triton Server + FT in the past, and now I use Triton Server + TensorRT-LLM with inflight-batching, but the outputTokensPerSecond between them having a big gap.
The Triton Server of TensorRT-LLM is:
The configurations of TensorRT-LLM is:
Builds TensorRT engine(s) from HF is:
and the outputTokensPerSecond is as following:
This difference is so weird, I don’t know what the problem is.