triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

low performance at large concurrent requests #7548

Open seyunchoi opened 3 weeks ago

seyunchoi commented 3 weeks ago

Description low speed in large concurrent requests

concurrent requests 1 50 100
TensorRT-llm 73.36 193.30 193.81
Vllm 64.13 984.55 1246.50

value is TPS (token per second) result of concurrent requests 50, 100 is similar

Triton Information

To Reproduce build model

python3 examples/llama/convert_checkpoint.py --model_dir /data/Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir /data/trt-Meta-Llama-3-8B-Instruct \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --max_batch_size 2048 \
            --max_input_len 4096 \
            --max_num_tokens 4096 \
            --multiple_profiles enable \
            --paged_kv_cache enable \
            --use_paged_context_fmha enable 

run tritonserver

git clone -b v0.11.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
mkdir -p repo/llama3
cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* repo/llama3/
cp ./trt-Meta-Llama-3-8B-Instruct/* repo/llama3/tensorrt_llm/1/

HF_LLAMA_MODEL="/data/Meta-Llama-3-8B-Instruct"
ENGINE_PATH="/data/repo/llama3/tensorrt_llm/1"
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:2048,preprocessing_instance_count:1
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:2048,postprocessing_instance_count:8
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/ensemble/config.pbtxt triton_max_batch_size:2048
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:2048,decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:10000,enable_chunked_context:True,max_num_sequences:256
rm -r repo/llama3/tensorrt_llm_bls
docker run --rm -it --net host --gpus all \
  --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \
  -v $(pwd):/data \
  nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 \
  tritonserver --model-repository=/data/repo/llama3 --backend-config=default-max-batch-size=2048

send request to /v2/models/ensemble/generate

Expected behavior TensorRT-llm result Expected to be faster concurrent requests 100 should faster than concurrent requests 50

janpetrov commented 3 weeks ago

would you try increasing max_num_tokens? https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html I to would be kind of you if you later describe the performance depending on the value of max_num_token (it has some optimal value, which likely is well above 4096, but there is definitely possible to overshoot)

seyunchoi commented 2 weeks ago

tested again by increasing max_num_tokens(4096 to 409600)

build model

python3 examples/llama/convert_checkpoint.py --model_dir /data/Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir /data/trt-Meta-Llama-3-8B-Instruct \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --max_batch_size 2048 \
            --max_input_len 4096 \
            --max_num_tokens 409600 \
            --multiple_profiles enable \
            --paged_kv_cache enable \
            --use_paged_context_fmha enable 

when increas max_num_tokens we checked VRAM memory use has increased (using more KV cache)

max_num_tokens GPU Memory Usage
4096 25590MiB
409600 64784MiB

this is new result of TensorRT-llm that increased max_num_tokens to 409600 tested only concurrent requests 100 by send request to /v2/models/ensemble/generate

concurrent requests 100
TensorRT-llm (4096) 193.81
TensorRT-llm (409600) 184.27
Vllm 1246.50

When I looked at the results, it looks have other problem

manickavela29 commented 1 week ago

facing similar issue comparing triton-server with vLLM and TRT-LLM backend. with 24.07

one observation made with --log-verbose=1 with triton-server running with 100 concurrency, but observing that Generation/Scheduled requests, is 5 is it alright?

I0831 18:57:08.685678 1 model_instance_state.cc:969] "{\"Active Request Count\":99,\"Iteration Counter\":392,\"Max Request Count\":256,\"Runtime CPU Memory Us
age\":90260,\"Runtime GPU Memory Usage\":2045966240,\"Runtime Pinned Memory Usage\":562149636,\"Timestamp\":\"08-31-2024 18:57:08\",\"Context Requests\":0,\"Generation Requests\":5,\"MicroBatch ID\":0,\"Paused Requests\":0,\"Scheduled Requests\":5,\"Total Context Tokens\":0,\"Free KV cache blocks\":9,\"Max KV cache blocks\":40,\"Tokens per KV cache block\":64,\"Used KV cache blocks\":31}"

Also observing that the client is receiving the response is count of ~5s, so inference is happening with 5 requests, thus concluding that triton server is not handling concurrency rightly

Also tried this with tensorrt_llm triton config file with different queue delay parameter and max-batch-size, build the trt engine also with similar max-batch-size but it didn't help

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 100
model_transaction_policy {
  decoupled: False
}
dynamic_batching {
    max_queue_delay_microseconds: 1000000
}
parameters {
  key: "max_batch_size"
  value: {string_value: "100"}
}

experimented with various dynami_batching strategies parameters but doesn't help

if the issue I am facing is totally different , I will create a new issue.

manickavela29 commented 1 week ago

could tokenizer or another component of the stack be a bottlneck? similar to https://github.com/triton-inference-server/server/issues/6894 ?

seyunchoi commented 1 week ago

it looks like same problem @manickavela29

I0903 07:34:54.755164 1 model_instance_state.cc:969] "{\"Active Request Count\":80,\"Iteration Counter\":14522,\"Max Request Count\":2048,\"Runtime CPU Memory Usage\":721044,\"Runtime GPU Memory Usage\":50039058456,\"Runtime Pinned Memory Usage\":739100676,\"Timestamp\":\"09-03-2024 07:34:54\",\"Context Requests\":0,\"Generation Requests\":3,\"MicroBatch ID\":0,\"Paused Requests\":0,\"Scheduled Requests\":3,\"Total Context Tokens\":0,\"Free KV cache blocks\":24,\"Max KV cache blocks\":40,\"Tokens per KV cache block\":64,\"Used KV cache blocks\":16}"