triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
588 stars 81 forks source link

Crashes for long context requests #381

Open Pernekhan opened 3 months ago

Pernekhan commented 3 months ago

trtllm crashes when I give long context requests within the max-input-length limits.

I believe it happens when total pending requests reach the max-num-tokens limit. But why it's not queuing requests instead of crashing?

Here is the crash log:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7fe15c26354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fe15c265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fe15c265741]
3       0x7fe15c3b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7fe284521b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7fe117705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fe117705ba9]
6       0x7fe1176db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fe1176db6af]
7       0x7fe1176dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fe1176dd320]
8       0x7fe15e147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7fe15e14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7fe15e14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7fe15e11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7fe15e12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7fe4a944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fe4a944f253]
14      0x7fe4a91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fe4a91dfac3]
15      0x7fe4a9271660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fe4a9271660]
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7fb8a826354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb8a8265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fb8a8265741]
3       0x7fb8a83b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7fb9e80dcb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7fb863705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fb863705ba9]
6       0x7fb8636db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fb8636db6af]
7       0x7fb8636dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fb8636dd320]
8       0x7fb8aa147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7fb8aa14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7fb8aa14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7fb8aa11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7fb8aa12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7fbbf424f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbbf424f253]
14      0x7fbbf3fdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbbf3fdfac3]
15      0x7fbbf4071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fbbf4071660]
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7f1e7426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f1e74265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7f1e74265741]
3       0x7f1e743b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7f1fb0280b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7f1e2f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f1e2f705ba9]
6       0x7f1e2f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f1e2f6db6af]
7       0x7f1e2f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f1e2f6dd320]
8       0x7f1e76147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7f1e7614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7f1e7614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7f1e7611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7f1e7612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7f21c024f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f21c024f253]
14      0x7f21bffdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f21bffdfac3]
15      0x7f21c0071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f21c0071660]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[trt-mixtral-chat-0:3614237] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
Signal (15) received.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7efc6426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7efc64265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7efc64265741]
3       0x7efc643b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7efdb045fb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7efc1f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7efc1f705ba9]
6       0x7efc1f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7efc1f6db6af]
7       0x7efc1f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7efc1f6dd320]
8       0x7efc66147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7efc6614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7efc6614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7efc6611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7efc6612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7effc0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7effc0e4f253]
14      0x7effc0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7effc0bdfac3]
15      0x7effc0c71660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7effc0c71660]
Signal (6) received.
[trt-mixtral-chat-0:3614237] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[trt-mixtral-chat-0:3614237] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Traceback (most recent call last):
  File "/app/scripts/launch_triton_server.py", line 89, in run_cmd
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mpirun', '--allow-run-as-root', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--log-verbose=3', '--log-file=triton_log.txt', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix0_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix1_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix2_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix3_', ':']' returned non-zero exit status 1.

cc: @kaiyux

schetlur-nv commented 3 months ago

Hey! You're right, it is expected to queue the requests. Can you share the engine build command please? Also, the test script or command if possible.

Pernekhan commented 3 months ago

here is the engine build command: trtllm-build --checkpoint_dir /data/tgi-data/trtllm/mixtral-8x7b-tp-4-converted/ --remove_input_padding enable --gpt_attention_plugin float16 --context_fmha enable --gemm_plugin float16 --output_dir /data/tgi-data/trtllm/mixtral-fp16-tp4-engine --paged_kv_cache enable --max_batch_size 64 --max_input_len 32768 --max_output_len 4096 --workers 4 --max_num_tokens 327680

This is just a simple script we used to make it crash echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo

Pernekhan commented 3 months ago

any updates for this @schetlur-nv ?

thorjohnsen commented 3 months ago

I will take a look at this.

silverriver commented 2 months ago

I have encountered a similar problem. My backend server crashes when the request concurrency is high. I posted the scripts I used in this issue:

https://github.com/triton-inference-server/tensorrtllm_backend/issues/392

thorjohnsen commented 2 months ago

@Pernekhan Can you post the script that you use to make it crash? The link you provided is local to your machine.

Pernekhan commented 2 months ago

@thorjohnsen here is the script and the request file attached.

echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo

Here is the file 8k-context-req.txt. You can also try any of your own 8k context requests. 8k-context-req.txt

thorjohnsen commented 2 months ago

Thank you @Pernekhan, can you provide models/ensemble/config.pbtxt? Also, I am not too familiar with curl, is models/ensemble/generate a script? If so, please provide it?

Pernekhan commented 2 months ago

I used the configs from all_models/inflight_batcher_llm with batch_size 64.

Here is the script that does what curl is trying to do.

import requests
import concurrent.futures

# Define the URL
url = "http://localhost:8000/v2/models/ensemble/generate"

# Define the payload data file
payload_data_file = "8k-context-req.txt"

# Define the number of parallel requests
num_requests = 64

# Define a function to make the request
def make_request(url, data):
    response = requests.post(url, data=data)
    return response.text

# Load the payload data
with open(payload_data_file, 'rb') as file:
    data = file.read()

# Function to make parallel requests
def make_parallel_requests(url, data, num_requests):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Submit the requests
        futures = [executor.submit(make_request, url, data) for _ in range(num_requests)]
        # Wait for all requests to complete
        for future in concurrent.futures.as_completed(futures):
            try:
                response = future.result()
                print(response)
            except Exception as e:
                print(f"An error occurred: {e}")

# Make parallel requests
make_parallel_requests(url, data, num_requests)
Pernekhan commented 2 months ago

Hi @thorjohnsen were you able to reproduce the issue?

thorjohnsen commented 2 months ago

I am sorry, I was OOTO for a few days. I will resume work on this issue now.

thorjohnsen commented 2 months ago

I can confirm that I am able to reproduce the issue. Now to find the cause.

thorjohnsen commented 2 months ago

I agree with @Pernekhan that crash likely happens when total pending requests reach the max-num-tokens limit. Server runs fine as long as number of parallel requests is low enough to not exceed limit.

thorjohnsen commented 2 months ago

I don't see a crash with LLama-v2-7b, so this issue might only affect MoE models.

thorjohnsen commented 2 months ago

A similar issue was reported internally by somebody at NVIDIA, and a fix is on the way. Daniel Stokes from our side will revisit this issue once that fix has been merged.

ekarmazin commented 1 hour ago

Any updates on this?

We using v0.10.0 with default BLS config from: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm

Same issue with 8K context requests.

8xA100 80G, model is: llama2 13B

It is much stable with KV Cahche reuse disabled, but significantly slower.

ekarmazin commented 41 minutes ago

This is the error we getting with 8K request with KV Cache reuse enabled:

[TensorRT-LLM][ERROR] Encountered an error in forwardSync function: [TensorRT-LLM][ERROR] Assertion failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:572)
1       0x7f84f02692b5 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f83fa32a9a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6b89a0) [0x7f83fa32a9a0]
3       0x7f83fc27f0ee tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::releaseBlocks(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 574
4       0x7f83fc27f678 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::removeSequence(int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 296
5       0x7f83fc2abfa6 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::terminateRequest(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&, bool) + 502
6       0x7f83fc2ae3cc tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderSync(tensorrt_llm::batch_manager::ScheduledRequests const&, std::unique_ptr<tensorrt_llm::runtime::decoder_batch::Token const, std::default_delete<tensorrt_llm::runtime::decoder_batch::Token const> > const&) + 1724