Open seyunchoi opened 3 weeks ago
would you try increasing max_num_tokens? https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html I to would be kind of you if you later describe the performance depending on the value of max_num_token (it has some optimal value, which likely is well above 4096, but there is definitely possible to overshoot)
tested again by increasing max_num_tokens(4096 to 409600)
build model
python3 examples/llama/convert_checkpoint.py --model_dir /data/Meta-Llama-3-8B-Instruct \
--output_dir ./tllm_checkpoint_1gpu_bf16 \
--dtype bfloat16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
--output_dir /data/trt-Meta-Llama-3-8B-Instruct \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_batch_size 2048 \
--max_input_len 4096 \
--max_num_tokens 409600 \
--multiple_profiles enable \
--paged_kv_cache enable \
--use_paged_context_fmha enable
when increas max_num_tokens we checked VRAM memory use has increased (using more KV cache)
max_num_tokens | GPU Memory Usage |
---|---|
4096 | 25590MiB |
409600 | 64784MiB |
this is new result of TensorRT-llm that increased max_num_tokens to 409600 tested only concurrent requests 100 by send request to /v2/models/ensemble/generate
concurrent requests | 100 |
---|---|
TensorRT-llm (4096) | 193.81 |
TensorRT-llm (409600) | 184.27 |
Vllm | 1246.50 |
When I looked at the results, it looks have other problem
facing similar issue comparing triton-server with vLLM and TRT-LLM backend. with 24.07
one observation made with --log-verbose=1 with triton-server running with 100 concurrency, but observing that Generation/Scheduled requests, is 5 is it alright?
I0831 18:57:08.685678 1 model_instance_state.cc:969] "{\"Active Request Count\":99,\"Iteration Counter\":392,\"Max Request Count\":256,\"Runtime CPU Memory Us
age\":90260,\"Runtime GPU Memory Usage\":2045966240,\"Runtime Pinned Memory Usage\":562149636,\"Timestamp\":\"08-31-2024 18:57:08\",\"Context Requests\":0,\"Generation Requests\":5,\"MicroBatch ID\":0,\"Paused Requests\":0,\"Scheduled Requests\":5,\"Total Context Tokens\":0,\"Free KV cache blocks\":9,\"Max KV cache blocks\":40,\"Tokens per KV cache block\":64,\"Used KV cache blocks\":31}"
Also observing that the client is receiving the response is count of ~5s, so inference is happening with 5 requests, thus concluding that triton server is not handling concurrency rightly
Also tried this with tensorrt_llm triton config file with different queue delay parameter and max-batch-size, build the trt engine also with similar max-batch-size but it didn't help
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 100
model_transaction_policy {
decoupled: False
}
dynamic_batching {
max_queue_delay_microseconds: 1000000
}
parameters {
key: "max_batch_size"
value: {string_value: "100"}
}
experimented with various dynami_batching strategies parameters but doesn't help
if the issue I am facing is totally different , I will create a new issue.
could tokenizer or another component of the stack be a bottlneck? similar to https://github.com/triton-inference-server/server/issues/6894 ?
it looks like same problem @manickavela29
I0903 07:34:54.755164 1 model_instance_state.cc:969] "{\"Active Request Count\":80,\"Iteration Counter\":14522,\"Max Request Count\":2048,\"Runtime CPU Memory Usage\":721044,\"Runtime GPU Memory Usage\":50039058456,\"Runtime Pinned Memory Usage\":739100676,\"Timestamp\":\"09-03-2024 07:34:54\",\"Context Requests\":0,\"Generation Requests\":3,\"MicroBatch ID\":0,\"Paused Requests\":0,\"Scheduled Requests\":3,\"Total Context Tokens\":0,\"Free KV cache blocks\":24,\"Max KV cache blocks\":40,\"Tokens per KV cache block\":64,\"Used KV cache blocks\":16}"
Description low speed in large concurrent requests
value is TPS (token per second) result of concurrent requests 50, 100 is similar
Triton Information
To Reproduce build model
run tritonserver
send request to
/v2/models/ensemble/generate
Expected behavior TensorRT-llm result Expected to be faster concurrent requests 100 should faster than concurrent requests 50