In production I run long-t5 model for data procesing, tried using onnxruntime-gpu 1.19.0. I run 3 processes on the same instances, which share GPU resources, but all processes kinda freeze after gradual GPU memory increase. In nvidia-smi I saw a processes using some GPU memory (not all), but application logs just stopped. Rolled back to onnxruntime to 1.18.0, which works fine. Current dependencies do not allow to upgrade to 1.20.0. I know that sharing GPU between processes may not be the best practice, but this is cost efficient and worked until now.
Describe the issue
In production I run long-t5 model for data procesing, tried using onnxruntime-gpu 1.19.0. I run 3 processes on the same instances, which share GPU resources, but all processes kinda freeze after gradual GPU memory increase. In nvidia-smi I saw a processes using some GPU memory (not all), but application logs just stopped. Rolled back to onnxruntime to 1.18.0, which works fine. Current dependencies do not allow to upgrade to 1.20.0. I know that sharing GPU between processes may not be the best practice, but this is cost efficient and worked until now.
Any ideas what could be eating up the memory?
To reproduce
The model I use: https://huggingface.co/agemagician/mlong-t5-tglobal-large
Urgency
No response
Platform
Linux
OS Version
Amazon Linux AMI 2.0.20230606 x86_64
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.19.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
Model File
No response
Is this a quantized model?
No