Why the outputTokensPerSecond is much smaller than Fastertransformer?

activezhao commented 10 months ago

I have used Triton Server + FT in the past, and now I use Triton Server + TensorRT-LLM with inflight-batching, but the outputTokensPerSecond between them having a big gap.

max_new_tokens: 256
tp: 4
model: codeLlama-7b

The Triton Server of TensorRT-LLM is:

nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

The configurations of TensorRT-LLM is:

https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0/all_models/inflight_batcher_llm

Builds TensorRT engine(s) from HF is:

python build.py --model_dir ./META-CodeLlama-7b-hf/  \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir /tensorrtllm_backend/trt_llama_7b_fp16/4-gpu/  \
                --world_size 4 \
                --tp_size 4

and the outputTokensPerSecond is as following:

FT: 1800
TensorRT-LLM: 470

This difference is so weird, I don’t know what the problem is.

byshiue commented 10 months ago

Can you share how to reproduce the performance number you get? Which file do you use and what script do you use? It would be helpful to provide clear reproduced steps to help reproducing the issue.

activezhao commented 10 months ago

Can you share how to reproduce the performance number you get? Which file do you use and what script do you use? It would be helpful to provide clear reproduced steps to help reproducing the issue.

@byshiue OK, I will try to give more details soon, but one key point is that the "batch_size" I use is 1.

In fact, I also use benchmark of TensorRT-LLM to test, but the result is even worse.

https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/benchmarks/python/README.md

And I use the flowing command for test:

mpirun --allow-run-as-root -n 4 python benchmark.py \
    -m llama_7b \
    --mode plugin \
    --batch_size "1" \
    --input_output_len "500,200"

The tokenPerSecond is only 91. Is there sth wrong?

byshiue commented 10 months ago

I am not sure is tokenPerSecond 91 reasonable or not because I am not sure how do you compute the tokenPerSecond (is it including the input + output? or is it only including output), and I don't understand your network topology and hardward settings, which are important for tensor parallelism.

activezhao commented 10 months ago

I am not sure is tokenPerSecond 91 reasonable or not because I am not sure how do you compute the tokenPerSecond (is it including the input + output? or is it only including output), and I don't understand your network topology and hardward settings, which are important for tensor parallelism.

@byshiue I use A10 for test, and model is llama_7b.

the "tokens_per_sec" data is given by the TensorRT-LLM benchmark, which is:

https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/benchmarks/python/benchmark.py

and tokens_per_sec is computed as the flowing:

https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/benchmarks/python/gpt_benchmark.py#L459

The command and result is as the flowing:

mpirun --allow-run-as-root -n 4 python benchmark.py     -m llama_7b     --mode plugin     --batch_size "1"     --input_output_len "500,200"

[BENCHMARK] model_name llama_7b world_size 4 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 1 input_length 500 output_length 200 gpu_peak_mem(gb) 6.78 build_time(s) 21.03 tokens_per_sec 95.27 percentile95(ms) 2108.397 percentile99(ms) 2111.005 latency(ms) 2099.3 compute_cap sm70

byshiue commented 10 months ago

Thank you for the response. The tokens_per_sec in benchmark.py is computed by bs * output_len / second. In such scenario, tokens_per_sec 95 looks reasonable for me under batch size 1.

You can try using larger batch size to increase the throughput.

activezhao commented 10 months ago

Thank you for the response. The tokens_per_sec in benchmark.py is computed by bs * output_len / second. In such scenario, tokens_per_sec 95 looks reasonable for me under batch size 1.

You can try using larger batch size to increase the throughput.

@byshiue Thanks for your quick reply, in fact, I have tested with larger batch size, and the largest one of tokens_per_sec is 1129.

But I have a question, when I use endpoint to request with inflight-batching mode, just like this

curl -X POST localhost:8000/v2/models/${MODEL_NAME}/generate -d '{"{PARAM1_KEY}": "{PARAM1_VALUE}", ... }'

and the max_batch_size is 128, tp is 4, will the throughput increase as expected?

byshiue commented 10 months ago

I expect. Do you have any concern?

activezhao commented 10 months ago

I expect. Do you have any concern?

@byshiue I will just do more tests then, but I found that when building engines, there is a parameter named "max_batch_size", in my opinion, it is so wired, why we have to set a value now?

And if I set "max_batch_size" value to 8, but in config.pbtxt the value is 128, what will happen? Will the real "max_batch_size" is 8?

What's more, if I set "max_batch_size" value to 128, but in config.pbtxt the value is 64, what will happen? Will the real "max_batch_size" is 64? Or it will be error?

https://github.com/NVIDIA/TensorRT-LLM/blob/11e14500f35dd781b535ba009c906f55ecfee3b5/examples/llama/build.py#L158C6-L158C6

I also found that in the building engines files, there is a file named config.json, and the "max_batch_size" is here.

Then when building engines, setting the value of "max_batch_size" is just for assignment in this file. This value will not have other effects on the build. And whether the priority of the batch value in config.json is greater than config.pbtxt?

{
  "builder_config": {
    "fp8": false,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "int8": false,
    "max_batch_size": 64,
    "max_input_len": 2048,
    "max_num_tokens": null,
    "max_output_len": 512,
    "max_position_embeddings": 16384,
    "name": "llama",
    "num_heads": 32,
    "num_kv_heads": 32,
    "num_layers": 32,
    "parallel_build": false,
    "pipeline_parallel": 1,
    "precision": "float16",
    "quant_mode": 0,
    "tensor_parallel": 4,
    "use_refit": false,
    "vocab_size": 32016
  },
  "plugin_config": {
    "attention_qk_half_accumulation": false,
    "bert_attention_plugin": false,
    "context_fmha_type": 1,
    "gemm_plugin": "float16",
    "gpt_attention_plugin": "float16",
    "identity_plugin": false,
    "layernorm_plugin": false,
    "layernorm_quantization_plugin": false,
    "lookup_plugin": false,
    "nccl_plugin": "float16",
    "paged_kv_cache": true,
    "quantize_per_token_plugin": false,
    "quantize_tensor_plugin": false,
    "remove_input_padding": true,
    "rmsnorm_plugin": false,
    "rmsnorm_quantization_plugin": false,
    "smooth_quant_gemm_plugin": false,
    "tokens_per_block": 64,
    "use_custom_all_reduce": false,
    "weight_only_groupwise_quant_matmul_plugin": false,
    "weight_only_quant_matmul_plugin": false
  }
}

byshiue commented 10 months ago

The max_batch_size in config.json is a hyper-parameter for the engine, which means the maximum batch size supported by this engine. It is used to compute the workspace we need during inference.

The max_batch_size in config.pbtxt of backend means the maximum batch size of request you will recieve and send to server.

So, when the batch_size you send to server is larger than the max_batch_size of config.json of engine, it is invalid and should be throw error.

activezhao commented 10 months ago

The max_batch_size in config.json is a hyper-parameter for the engine, which means the maximum batch size supported by this engine. It is used to compute the workspace we need during inference.

The max_batch_size in config.pbtxt of backend means the maximum batch size of request you will recieve and send to server.

So, when the batch_size you send to server is larger than the max_batch_size of config.json of engine, it is invalid and should be throw error.

@byshiue Thank u so much for your detailed answer, I learnt a lot.

I just set max_batch_size value to 64 in config.json, and I test the tokens_per_sec in our own way, and the current indicator seems to be on par with vLLM, it's about 1,000, which I think is within expectations.

Thank u again for help.

byshiue commented 10 months ago

@activezhao Thank you for the update. It looks your issue is solved. Close this bug. Feel free to repoen it if needed.

triton-inference-server / tensorrtllm_backend

Why the outputTokensPerSecond is much smaller than Fastertransformer? #72