How to enable batching correctly ？

lwbmowgli commented 7 months ago

When I use xverse-13b for inference（HTTPService /Post）, my concurrency is 4, I think I have enabled dynamic batching, so thenv_inference_exec_count of the model should be 1, but the number returned is 4 == nv_inference_count , and the inference time also increases linearly. Below is my configuration and files

python3 build_xverse.py --model_dir stack-xverse-13b\ --dtype float16 \ --paged_kv_cache \ --max_batch_size 16 \ --max_input_len 1024\ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --enable_context_fmha \ --output_dir VERSE-13B_triton_inflight_1024_batch16 \ --world_size 8 \ --tp_size 8

build_xverse.py is https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.1/examples/llama/build.py

cp all_models/inflight_batcher_llm/ xverse13b_test_inflight_bs16 -r

python3 tools/fill_template.py -i xverse13b_test_inflight_bs16/preprocessing/config.pbtxt tokenizer_dir:/mnt/liwenbo_workspace/trt_engine/XVERSE-7B-Chat,tokenizer_type:auto,triton_max_batch_size:16,preprocessing_instance_count:1 python3 tools/fill_template.py -i xverse13b_test_inflight_bs16/postprocessing/config.pbtxt tokenizer_dir:/mnt/liwenbo_workspace/trt_engine/XVERSE-7B-Chat,tokenizer_type:auto,triton_max_batch_size:16,postprocessing_instance_count:1 python3 tools/fill_template.py -i xverse13b_test_inflight_bs16/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:16,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i xverse13b_test_inflight_bs16/ensemble/config.pbtxt triton_max_batch_size:16 python3 tools/fill_template.py -i xverse13b_test_inflight_bs16/tensorrt_llm/config.pbtxt triton_max_batch_size:16,decoupled_mode:False,max_beam_width:1,engine_dir:XVERSE13b_triton_inflight_batch8,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

Repeat the request 10 times concurrency is 4 according to https://github.com/triton-inference-server/server/blob/5630efe6e0d8b74b72a793678722f89112240f76/docs/user_guide/metrics.md the markers in the image above should not be equal when I have enabled dynamic batching. So my question is how to correctly enable batching？ I think I have enabled it through my configuration above.

environmental information： docker imges nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 pip install tensorrt_llm==0.7.1 tensorrtllm_backend==0.7.1

lwbmowgli commented 7 months ago

I think we have the same problem https://github.com/triton-inference-server/tensorrtllm_backend/issues/333

youzhedian commented 7 months ago

I have the same problem.

triton-inference-server / tensorrtllm_backend

How to enable batching correctly ？ #360