triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

`max_batch_size` seems to have no impact on model performance #429

Open VitalyPetrov opened 2 months ago

VitalyPetrov commented 2 months ago

System Info

Who can help?

@kaiyux

Information

Tasks

Reproduction

Given the following official guide for llama-based LLM-s serialization and inference https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md

I am using a Vikhr 7b model which is build on LlamaForCausalLM architecture.

The cmd for building model is as following:

python3 convert_checkpoint.py \
        --model_dir=/hf/models/vikhr7b \
        --output_dir=/vikhr/xxx/checkpoint/ \
        --dtype float16 \
        --tp_size 1

 trtllm-build \
        --checkpoint_dir=/vikhr/xxx/checkpoint/ \
        --output_dir=/vikhr/xxx/trt_engine/ \
        --gemm_plugin float16 \
        --max_batch_size 64 \
        --max_input_len 4096 \
        --workers=1

And after that I am filling the triton config files and run the triton-server:

cp -r /vikhr/xxx/trt_engine/ /models/vikhr_model/tensorrt_llm/1/

python3 utils/fill_template.py -i ./models/vikhr_model/preprocessing/config.pbtxt tokenizer_dir:/models/vikhr_tokenizer/,tokenizer_type:llama,triton_max_batch_size:64,preprocessing_instance_count:1
python3 utils/fill_template.py -i ./models/vikhr_model/postprocessing/config.pbtxt tokenizer_dir:/models/vikhr_tokenizer/,tokenizer_type:llama,triton_max_batch_size:64,postprocessing_instance_count:1
python3 utils/fill_template.py -i ./models/vikhr_model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 utils/fill_template.py -i /models/vikhr_model/ensemble/config.pbtxt triton_max_batch_size:64
python3 utils/fill_template.py -i ./models/vikhr_model/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/models/vikhr_model/tensorrt_llm/1/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

Launch the server

python3 launch_triton_server.py --world_size 1 --model_repo=/models/vikhr_model/

The problem is -- there is not difference between two different batch sizes in terms of model performance (number of seconds for llm to provide the response on /v2/models/ensemble/generate).

Considering two different max_batch_size: 64 and 128. The second requires significantly more VRAM offers no performance gain over the smaller batch size.

Expected behavior

Greater values of max_batch_size requires more VRAM but leads to greater performance in terms of model inference

actual behavior

No difference between two values for max_batch_size in terms of performance

additional notes

I thought the difference would appear on wide context windows but the tendency is similar.

VitalyPetrov commented 2 months ago

@byshiue it seems like you can also help with this issue

aptmess commented 2 months ago

Absolutely the same on Mixtral7bx8

+1, need help!

byshiue commented 2 months ago

What are the input length and output length of your requests?

VitalyPetrov commented 2 months ago

@byshiue max_input_len = 4096 and max_output_len = 512

wanzhenchn commented 2 months ago

@byshiue I have also observed similar phenomena. From batch_size = 4 to batch_size = 8, batch_size = 16, batch_size = 32, the Token QPS and Service QPS appears to have little variation. At the same time, the average latency has approximately doubled respectively.

What's the matter?

image

The specifics are as following:

#  the cmd for building trt-llm engine
python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /path/to/vicuna-13b-v1.5 \
    --tp_size 2 \
    --dtype float16 \
    --output_dir vicuna-13b-converted

trtllm-build --checkpoint_dir vicuna-13b-converted \
  --gemm_plugin float16 \
  --gpt_attention_plugin float16 \
  --max_batch_size  64 \
  --max_input_len 2048 \
  --max_output_len 1024 \
  --output_dir vicuna-13b-engine 

# And after that I am filling the triton-model-repo config files and run the triton-server:
triton_model_dir=$1
tokenizer_dir=$2

cp -r all_models/inflight_batcher_llm ${triton_model_dir}

triton_max_batch_size=64
kv_cache_free_gpu_mem_fraction=0.9
max_beam_width=1
max_queue_delay_microseconds=0

engine_path=${triton_model_dir%/}/tensorrt_llm/1
engine_config_path=${triton_model_dir%/}/tensorrt_llm/config.pbtxt
preprocess_config_path=${triton_model_dir%/}/preprocessing/config.pbtxt
postprocess_config_path=${triton_model_dir%/}/postprocessing/config.pbtxt
ensemble_config_path=${triton_model_dir%/}/ensemble/config.pbtxt
bls_config_path=${triton_model_dir%/}/tensorrt_llm_bls/config.pbtxt

python  fill_template.py --in_place ${engine_config_path} \
triton_max_batch_size:${triton_max_batch_size},batching_strategy:inflight_fused_batching,engine_dir:${engine_path},batch_scheduler_policy:max_utilization,decoupled_mode:True,kv_cache_free_gpu_mem_fraction:${kv_cache_free_gpu_mem_fraction},max_beam_width:${max_beam_width},max_queue_delay_microseconds:${max_queue_delay_microseconds}

python fill_template.py --in_place ${preprocess_config_path} \
  tokenizer_dir:${tokenizer_dir},triton_max_batch_size:${triton_max_batch_size},preprocessing_instance_count:1

python fill_template.py --in_place ${postprocess_config_path} \
  tokenizer_dir:${tokenizer_dir},triton_max_batch_size:${triton_max_batch_size},postprocessing_instance_count:1

python fill_template.py --in_place ${ensemble_config_path} \
  triton_max_batch_size:${triton_max_batch_size}

python ${fill_template} --in_place ${bls_config_path} \
  triton_max_batch_size:${triton_max_batch_size},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

# Launch the server
python scripts/launch_triton_server.py --model_repo=${triton_model_repo} --world_size 2

Then conducting benchmrk test with API /v2/models/ensemble/generate_stream using the coroutine function: https://github.com/vllm-project/vllm/blob/eefeb16464af5f3a61e3052d1a4128480bff7f47/benchmarks/backend_request_func.py#L102

wanzhenchn commented 1 month ago

@byshiue Any informations about this issue are shared?

kaiyux commented 1 month ago

@VitalyPetrov Thanks for providing the details, I'll try reproduce the issue. Are you using a 40GB, or 80GB A100? Did you observe the actual runtime batch size?

A potential reason could be that the actual batch size is limited by the GPU memory, so that the batch size cannot hit 64, hence increase max_batch_size does not have effects.

VitalyPetrov commented 1 month ago

@kaiyux we use 80 Gb version of A100.

If your assumption is correct then there should be certain difference between two relatively low values of batch size (say, 2 and 8). However there is no huge impact on LLM performance.