`max_batch_size` seems to have no impact on model performance

VitalyPetrov commented 2 months ago

System Info

CPU architecture: x86_64
GPU: 1 x Nvidia A100
Docker image for LLM serialization: nvidia/cuda:12.1.0-devel-ubuntu22.04
Docker image for triton server launch: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
TensorRT LLM version: v0.8.0

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Given the following official guide for llama-based LLM-s serialization and inference https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md

I am using a Vikhr 7b model which is build on LlamaForCausalLM architecture.

The cmd for building model is as following:

python3 convert_checkpoint.py \
        --model_dir=/hf/models/vikhr7b \
        --output_dir=/vikhr/xxx/checkpoint/ \
        --dtype float16 \
        --tp_size 1

 trtllm-build \
        --checkpoint_dir=/vikhr/xxx/checkpoint/ \
        --output_dir=/vikhr/xxx/trt_engine/ \
        --gemm_plugin float16 \
        --max_batch_size 64 \
        --max_input_len 4096 \
        --workers=1

And after that I am filling the triton config files and run the triton-server:

cp -r /vikhr/xxx/trt_engine/ /models/vikhr_model/tensorrt_llm/1/

python3 utils/fill_template.py -i ./models/vikhr_model/preprocessing/config.pbtxt tokenizer_dir:/models/vikhr_tokenizer/,tokenizer_type:llama,triton_max_batch_size:64,preprocessing_instance_count:1
python3 utils/fill_template.py -i ./models/vikhr_model/postprocessing/config.pbtxt tokenizer_dir:/models/vikhr_tokenizer/,tokenizer_type:llama,triton_max_batch_size:64,postprocessing_instance_count:1
python3 utils/fill_template.py -i ./models/vikhr_model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 utils/fill_template.py -i /models/vikhr_model/ensemble/config.pbtxt triton_max_batch_size:64
python3 utils/fill_template.py -i ./models/vikhr_model/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/models/vikhr_model/tensorrt_llm/1/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

Launch the server

python3 launch_triton_server.py --world_size 1 --model_repo=/models/vikhr_model/

The problem is -- there is not difference between two different batch sizes in terms of model performance (number of seconds for llm to provide the response on /v2/models/ensemble/generate).

Considering two different max_batch_size: 64 and 128. The second requires significantly more VRAM offers no performance gain over the smaller batch size.

Expected behavior

Greater values of max_batch_size requires more VRAM but leads to greater performance in terms of model inference

actual behavior

No difference between two values for max_batch_size in terms of performance

additional notes

I thought the difference would appear on wide context windows but the tendency is similar.

VitalyPetrov commented 2 months ago

@byshiue it seems like you can also help with this issue

aptmess commented 2 months ago

Absolutely the same on Mixtral7bx8

+1, need help!

byshiue commented 2 months ago

What are the input length and output length of your requests?

VitalyPetrov commented 2 months ago

@byshiue max_input_len = 4096 and max_output_len = 512

wanzhenchn commented 2 months ago

@byshiue I have also observed similar phenomena. From batch_size = 4 to batch_size = 8, batch_size = 16, batch_size = 32, the Token QPS and Service QPS appears to have little variation. At the same time, the average latency has approximately doubled respectively.

What's the matter?

The specifics are as following:

GPU: 2xA30 (24GB)
Docker image for triton server: Dockerfile.trt_llm_backend
TensorRT LLM version: 0.10.0.dev2024042300
Model: Vicuna-13b-v1.5
Test data: 200 prompts

#  the cmd for building trt-llm engine
python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /path/to/vicuna-13b-v1.5 \
    --tp_size 2 \
    --dtype float16 \
    --output_dir vicuna-13b-converted

trtllm-build --checkpoint_dir vicuna-13b-converted \
  --gemm_plugin float16 \
  --gpt_attention_plugin float16 \
  --max_batch_size  64 \
  --max_input_len 2048 \
  --max_output_len 1024 \
  --output_dir vicuna-13b-engine 

# And after that I am filling the triton-model-repo config files and run the triton-server:
triton_model_dir=$1
tokenizer_dir=$2

cp -r all_models/inflight_batcher_llm ${triton_model_dir}

triton_max_batch_size=64
kv_cache_free_gpu_mem_fraction=0.9
max_beam_width=1
max_queue_delay_microseconds=0

engine_path=${triton_model_dir%/}/tensorrt_llm/1
engine_config_path=${triton_model_dir%/}/tensorrt_llm/config.pbtxt
preprocess_config_path=${triton_model_dir%/}/preprocessing/config.pbtxt
postprocess_config_path=${triton_model_dir%/}/postprocessing/config.pbtxt
ensemble_config_path=${triton_model_dir%/}/ensemble/config.pbtxt
bls_config_path=${triton_model_dir%/}/tensorrt_llm_bls/config.pbtxt

python  fill_template.py --in_place ${engine_config_path} \
triton_max_batch_size:${triton_max_batch_size},batching_strategy:inflight_fused_batching,engine_dir:${engine_path},batch_scheduler_policy:max_utilization,decoupled_mode:True,kv_cache_free_gpu_mem_fraction:${kv_cache_free_gpu_mem_fraction},max_beam_width:${max_beam_width},max_queue_delay_microseconds:${max_queue_delay_microseconds}

python fill_template.py --in_place ${preprocess_config_path} \
  tokenizer_dir:${tokenizer_dir},triton_max_batch_size:${triton_max_batch_size},preprocessing_instance_count:1

python fill_template.py --in_place ${postprocess_config_path} \
  tokenizer_dir:${tokenizer_dir},triton_max_batch_size:${triton_max_batch_size},postprocessing_instance_count:1

python fill_template.py --in_place ${ensemble_config_path} \
  triton_max_batch_size:${triton_max_batch_size}

python ${fill_template} --in_place ${bls_config_path} \
  triton_max_batch_size:${triton_max_batch_size},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

# Launch the server
python scripts/launch_triton_server.py --model_repo=${triton_model_repo} --world_size 2

Then conducting benchmrk test with API /v2/models/ensemble/generate_stream using the coroutine function: https://github.com/vllm-project/vllm/blob/eefeb16464af5f3a61e3052d1a4128480bff7f47/benchmarks/backend_request_func.py#L102

wanzhenchn commented 1 month ago

@byshiue Any informations about this issue are shared?

kaiyux commented 1 month ago

@VitalyPetrov Thanks for providing the details, I'll try reproduce the issue. Are you using a 40GB, or 80GB A100? Did you observe the actual runtime batch size?

A potential reason could be that the actual batch size is limited by the GPU memory, so that the batch size cannot hit 64, hence increase max_batch_size does not have effects.

VitalyPetrov commented 1 month ago

@kaiyux we use 80 Gb version of A100.

If your assumption is correct then there should be certain difference between two relatively low values of batch size (say, 2 and 8). However there is no huge impact on LLM performance.

triton-inference-server / tensorrtllm_backend