Server failure when running several model instances on one GPU

hawkeoni commented 6 months ago

System Info

CPU Architecture - x86_64
CPU/Host memory size - 330gb from /proc/meminfo
GPU name - NVIDIA H100 80GB HBM3
GPU memory size - 81559MiB
TensorRT-LLM branch - v0.7.1
Versions: triton server - 2.39.0, trtllm - 0.7.1, cuda version 12.2, nvidia-cublas-cu12\==12.1.3.1
Container - building official container.
NVidia driver 535.54.03
os - Ubuntu 22.04.3 LTS

Who can help?

@byshiue @schetlur-nv

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

The problem is that when running several model instances on one gpu (in one or different containers) one of the instances fails with cuda error. I've found a setup which allows me to reliably reproduce it using open source model and scripts from this repository.

Take the model https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b and save it locally:

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-Llama2-13b")
tok.save_pretrained("llama-13b")
model = AutoModelForCausalLM.from_pretrained("NousResearch/Nous-Hermes-Llama2-13b")
model.save_pretrained("llama-13b")

Convert the model inside v0.7.1 container using examples/llama/build.py

MODEL_PATH=llama-13b
ENGINE_INNER_PATH=llama-13b-engine
CUDA_VISIBLE_DEVICES=1 python3 /app/tensorrt_llm/examples/llama/build.py \
--model_dir ${MODEL_PATH} \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--max_batch_size 4 \
--output_dir ${ENGINE_INNER_PATH} \
--world_size 1 \
--tp_size 1

Launch the server twice one the same GPU. You can do it twice in the same container or in two different containers, I've reproduced it both ways.

# First tmux session / container
CUDA_VISIBLE_DEVICES=1 mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=/app/triton-pipeline --disable-auto-complete-config --allow-metrics 0 --allow-grpc 0 --http-port 8543 --backend-config=python,shm-region-prefix-name=prefix8543_ : 2>&1 | tee log1.txt
# Second session
CUDA_VISIBLE_DEVICES=1 mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=/app/triton-pipeline --disable-auto-complete-config --allow-metrics 0 --allow-grpc 0 --http-port 8544 --backend-config=python,shm-region-prefix-name=prefix8544_ : 2>&1 | tee log2.txt

The directory /app/triton-pipeline is attached as an archive triton-pipeline.tar.gz. Now you have 2 models running on ports 8543 and 8544.

Launch load tests - https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/tools/gpt/benchmark_core_model.py They are actually broken because of some parameters which are now uint32 instead of int32, so I made a patch named utils_diff.patch, which should be applied to this file - https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/tools/utils/utils.py

Since the scripts are not made for load testing I had to make a small crutch.

# instance 1
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14
do
        python3 benchmark_core_model.py -c 16 -topk 40 -topp 1 -o 100 -u localhost:8543 -b 1 -n 10 &
done
# instance 2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14
do
        python3 benchmark_core_model.py -c 16 -topk 40 -topp 1 -o 100 -u localhost:8544 -b 1 -n 10 &
done

After 1 to 5 minutes pass generally one of the servers crashes with an error:

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeAsync(ptr, mCudaStream->get()): unspecified launch failure (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:117)
....
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): unspecified launch failure (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
...
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeHost(ptr): unspecified launch failure (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:140)

I've attached the full log as log.txt, I've gathered it with --log-verbose 3 option.

Once again I'm listing the files that I've attached:

triton-pipeline.tar.gz - a folder with model configuration which I use to reproduce this issue
utils_diff.patch - a patch for utils.py which fixes outdated datatypes and allows us to use benchmark_core_model.py
log.txt - full failure log for your convenience. It was written with --log-verbose 3 so there is a lot of information. You can find failure in the end of the file or by searching for the first ERROR on line 11130. utils_diff.patch log.txt triton-pipeline.tar.gz

Expected behavior

I expected that running separate instances of the model would be independent of each other and would not lead to any runtime failures.

actual behavior

In this setup one of the instances randomly fails after a couple of minutes under a certain intensity of requests.

additional notes

I've checked it in one docker, in two separate dockers. Initially I've found this issue in fp8 inference, but I decided to reproduce it in fp16, which would simplify the investigation. It also works on a version of code from november as well as on 0.7.1 I haven't tested it on other versions.

ekarmazin commented 5 months ago

We found the same issue on v0.8.0. Solution was to dedicate a GPU per container.

hawkeoni commented 5 months ago

@byshiue @schetlur-nv any chance you'll take a look at the issue anytime soon?

hawkeoni commented 2 months ago

Maybe this is going to help resolve the issue or maybe it helps anyone who also has this issue: this happens on drivers 535.54.03 and 535.129.03 on both SXM and PCIe setups. It also fails on varoius trtllm versions v0.8.0 and v0.10.0 (latest as of the moment of writing).

Updating to 550.90.07 helped with both trtllm versions.

triton-inference-server / tensorrtllm_backend