microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.79k stars 2.94k forks source link

CUDA memory increasing and process freeze [Performance] #22872

Open kkluonaitis opened 6 days ago

kkluonaitis commented 6 days ago

Describe the issue

In production I run long-t5 model for data procesing, tried using onnxruntime-gpu 1.19.0. I run 3 processes on the same instances, which share GPU resources, but all processes kinda freeze after gradual GPU memory increase. In nvidia-smi I saw a processes using some GPU memory (not all), but application logs just stopped. Rolled back to onnxruntime to 1.18.0, which works fine. Current dependencies do not allow to upgrade to 1.20.0. I know that sharing GPU between processes may not be the best practice, but this is cost efficient and worked until now.

Any ideas what could be eating up the memory?

To reproduce

The model I use: https://huggingface.co/agemagician/mlong-t5-tglobal-large

Urgency

No response

Platform

Linux

OS Version

Amazon Linux AMI 2.0.20230606 x86_64

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

Model File

No response

Is this a quantized model?

No