Open Pernekhan opened 3 months ago
Hey! You're right, it is expected to queue the requests. Can you share the engine build command please? Also, the test script or command if possible.
here is the engine build command:
trtllm-build --checkpoint_dir /data/tgi-data/trtllm/mixtral-8x7b-tp-4-converted/ --remove_input_padding enable --gpt_attention_plugin float16 --context_fmha enable --gemm_plugin float16 --output_dir /data/tgi-data/trtllm/mixtral-fp16-tp4-engine --paged_kv_cache enable --max_batch_size 64 --max_input_len 32768 --max_output_len 4096 --workers 4 --max_num_tokens 327680
This is just a simple script we used to make it crash
echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo
any updates for this @schetlur-nv ?
I will take a look at this.
I have encountered a similar problem. My backend server crashes when the request concurrency is high. I posted the scripts I used in this issue:
https://github.com/triton-inference-server/tensorrtllm_backend/issues/392
@Pernekhan Can you post the script that you use to make it crash? The link you provided is local to your machine.
@thorjohnsen here is the script and the request file attached.
echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo
Here is the file 8k-context-req.txt
. You can also try any of your own 8k context requests.
8k-context-req.txt
Thank you @Pernekhan, can you provide models/ensemble/config.pbtxt? Also, I am not too familiar with curl, is models/ensemble/generate a script? If so, please provide it?
I used the configs from all_models/inflight_batcher_llm with batch_size 64.
Here is the script that does what curl is trying to do.
import requests
import concurrent.futures
# Define the URL
url = "http://localhost:8000/v2/models/ensemble/generate"
# Define the payload data file
payload_data_file = "8k-context-req.txt"
# Define the number of parallel requests
num_requests = 64
# Define a function to make the request
def make_request(url, data):
response = requests.post(url, data=data)
return response.text
# Load the payload data
with open(payload_data_file, 'rb') as file:
data = file.read()
# Function to make parallel requests
def make_parallel_requests(url, data, num_requests):
with concurrent.futures.ThreadPoolExecutor() as executor:
# Submit the requests
futures = [executor.submit(make_request, url, data) for _ in range(num_requests)]
# Wait for all requests to complete
for future in concurrent.futures.as_completed(futures):
try:
response = future.result()
print(response)
except Exception as e:
print(f"An error occurred: {e}")
# Make parallel requests
make_parallel_requests(url, data, num_requests)
Hi @thorjohnsen were you able to reproduce the issue?
I am sorry, I was OOTO for a few days. I will resume work on this issue now.
I can confirm that I am able to reproduce the issue. Now to find the cause.
I agree with @Pernekhan that crash likely happens when total pending requests reach the max-num-tokens limit. Server runs fine as long as number of parallel requests is low enough to not exceed limit.
I don't see a crash with LLama-v2-7b, so this issue might only affect MoE models.
A similar issue was reported internally by somebody at NVIDIA, and a fix is on the way. Daniel Stokes from our side will revisit this issue once that fix has been merged.
Any updates on this?
We using v0.10.0 with default BLS config from: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm
Same issue with 8K context requests.
8xA100 80G, model is: llama2 13B
It is much stable with KV Cahche reuse disabled, but significantly slower.
This is the error we getting with 8K request with KV Cache reuse enabled:
[TensorRT-LLM][ERROR] Encountered an error in forwardSync function: [TensorRT-LLM][ERROR] Assertion failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:572)
1 0x7f84f02692b5 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2 0x7f83fa32a9a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6b89a0) [0x7f83fa32a9a0]
3 0x7f83fc27f0ee tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::releaseBlocks(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 574
4 0x7f83fc27f678 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::removeSequence(int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 296
5 0x7f83fc2abfa6 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::terminateRequest(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&, bool) + 502
6 0x7f83fc2ae3cc tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderSync(tensorrt_llm::batch_manager::ScheduledRequests const&, std::unique_ptr<tensorrt_llm::runtime::decoder_batch::Token const, std::default_delete<tensorrt_llm::runtime::decoder_batch::Token const> > const&) + 1724
trtllm crashes when I give long context requests within the
max-input-length
limits.I believe it happens when total pending requests reach the
max-num-tokens
limit. But why it's not queuing requests instead of crashing?Here is the crash log:
cc: @kaiyux