Open VitalyPetrov opened 2 months ago
@byshiue it seems like you can also help with this issue
Absolutely the same on Mixtral7bx8
+1, need help!
What are the input length and output length of your requests?
@byshiue max_input_len = 4096
and max_output_len = 512
@byshiue I have also observed similar phenomena. From batch_size = 4
to batch_size = 8, batch_size = 16, batch_size = 32
, the Token QPS and Service QPS appears to have little variation. At the same time, the average latency has approximately doubled respectively.
What's the matter?
The specifics are as following:
# the cmd for building trt-llm engine
python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir /path/to/vicuna-13b-v1.5 \
--tp_size 2 \
--dtype float16 \
--output_dir vicuna-13b-converted
trtllm-build --checkpoint_dir vicuna-13b-converted \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 1024 \
--output_dir vicuna-13b-engine
# And after that I am filling the triton-model-repo config files and run the triton-server:
triton_model_dir=$1
tokenizer_dir=$2
cp -r all_models/inflight_batcher_llm ${triton_model_dir}
triton_max_batch_size=64
kv_cache_free_gpu_mem_fraction=0.9
max_beam_width=1
max_queue_delay_microseconds=0
engine_path=${triton_model_dir%/}/tensorrt_llm/1
engine_config_path=${triton_model_dir%/}/tensorrt_llm/config.pbtxt
preprocess_config_path=${triton_model_dir%/}/preprocessing/config.pbtxt
postprocess_config_path=${triton_model_dir%/}/postprocessing/config.pbtxt
ensemble_config_path=${triton_model_dir%/}/ensemble/config.pbtxt
bls_config_path=${triton_model_dir%/}/tensorrt_llm_bls/config.pbtxt
python fill_template.py --in_place ${engine_config_path} \
triton_max_batch_size:${triton_max_batch_size},batching_strategy:inflight_fused_batching,engine_dir:${engine_path},batch_scheduler_policy:max_utilization,decoupled_mode:True,kv_cache_free_gpu_mem_fraction:${kv_cache_free_gpu_mem_fraction},max_beam_width:${max_beam_width},max_queue_delay_microseconds:${max_queue_delay_microseconds}
python fill_template.py --in_place ${preprocess_config_path} \
tokenizer_dir:${tokenizer_dir},triton_max_batch_size:${triton_max_batch_size},preprocessing_instance_count:1
python fill_template.py --in_place ${postprocess_config_path} \
tokenizer_dir:${tokenizer_dir},triton_max_batch_size:${triton_max_batch_size},postprocessing_instance_count:1
python fill_template.py --in_place ${ensemble_config_path} \
triton_max_batch_size:${triton_max_batch_size}
python ${fill_template} --in_place ${bls_config_path} \
triton_max_batch_size:${triton_max_batch_size},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
# Launch the server
python scripts/launch_triton_server.py --model_repo=${triton_model_repo} --world_size 2
Then conducting benchmrk test with API /v2/models/ensemble/generate_stream
using the coroutine function:
https://github.com/vllm-project/vllm/blob/eefeb16464af5f3a61e3052d1a4128480bff7f47/benchmarks/backend_request_func.py#L102
@byshiue Any informations about this issue are shared?
@VitalyPetrov Thanks for providing the details, I'll try reproduce the issue. Are you using a 40GB, or 80GB A100? Did you observe the actual runtime batch size?
A potential reason could be that the actual batch size is limited by the GPU memory, so that the batch size cannot hit 64, hence increase max_batch_size does not have effects.
@kaiyux we use 80 Gb version of A100.
If your assumption is correct then there should be certain difference between two relatively low values of batch size (say, 2 and 8). However there is no huge impact on LLM performance.
System Info
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Given the following official guide for llama-based LLM-s serialization and inference https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md
I am using a Vikhr 7b model which is build on
LlamaForCausalLM
architecture.The cmd for building model is as following:
And after that I am filling the triton config files and run the triton-server:
Launch the server
The problem is -- there is not difference between two different batch sizes in terms of model performance (number of seconds for llm to provide the response on
/v2/models/ensemble/generate
).Considering two different
max_batch_size
: 64 and 128. The second requires significantly more VRAM offers no performance gain over the smaller batch size.Expected behavior
Greater values of
max_batch_size
requires more VRAM but leads to greater performance in terms of model inferenceactual behavior
No difference between two values for
max_batch_size
in terms of performanceadditional notes
I thought the difference would appear on wide context windows but the tendency is similar.