Open hawkeoni opened 6 months ago
We found the same issue on v0.8.0. Solution was to dedicate a GPU per container.
@byshiue @schetlur-nv any chance you'll take a look at the issue anytime soon?
Maybe this is going to help resolve the issue or maybe it helps anyone who also has this issue:
this happens on drivers 535.54.03
and 535.129.03
on both SXM and PCIe setups. It also fails on varoius trtllm versions v0.8.0 and v0.10.0 (latest as of the moment of writing).
Updating to 550.90.07
helped with both trtllm versions.
System Info
/proc/meminfo
Who can help?
@byshiue @schetlur-nv
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The problem is that when running several model instances on one gpu (in one or different containers) one of the instances fails with cuda error. I've found a setup which allows me to reliably reproduce it using open source model and scripts from this repository.
Take the model
https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b
and save it locally:Convert the model inside v0.7.1 container using examples/llama/build.py
Launch the server twice one the same GPU. You can do it twice in the same container or in two different containers, I've reproduced it both ways.
The directory
/app/triton-pipeline
is attached as an archivetriton-pipeline.tar.gz
. Now you have 2 models running on ports 8543 and 8544.Launch load tests - https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/tools/gpt/benchmark_core_model.py They are actually broken because of some parameters which are now uint32 instead of int32, so I made a patch named
utils_diff.patch
, which should be applied to this file - https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/tools/utils/utils.pySince the scripts are not made for load testing I had to make a small crutch.
After 1 to 5 minutes pass generally one of the servers crashes with an error:
I've attached the full log as log.txt, I've gathered it with
--log-verbose 3
option.Once again I'm listing the files that I've attached:
triton-pipeline.tar.gz
- a folder with model configuration which I use to reproduce this issueutils_diff.patch
- a patch for utils.py which fixes outdated datatypes and allows us to usebenchmark_core_model.py
log.txt
- full failure log for your convenience. It was written with--log-verbose 3
so there is a lot of information. You can find failure in the end of the file or by searching for the firstERROR
on line 11130. utils_diff.patch log.txt triton-pipeline.tar.gzExpected behavior
I expected that running separate instances of the model would be independent of each other and would not lead to any runtime failures.
actual behavior
In this setup one of the instances randomly fails after a couple of minutes under a certain intensity of requests.
additional notes
I've checked it in one docker, in two separate dockers. Initially I've found this issue in fp8 inference, but I decided to reproduce it in fp16, which would simplify the investigation. It also works on a version of code from november as well as on 0.7.1 I haven't tested it on other versions.