[Performance] onnxruntime session uses 5x more system memory if torch is imported

darintay commented 1 year ago

Describe the issue

My ONNX session was using way more system memory than expected, narrowed it down to only occurring when torch is imported.

Loading the same model with and without torch imported goes from 5G to 1G of system memory.

To reproduce

Run this script with and without the torch import line. Using https://github.com/onnx/models/blob/main/vision/classification/mnist/model/mnist-8.onnx, though it seems to be the same for all models I've tried.

# Try with this line commented on or off.
# import torch
import onnxruntime
session = onnxruntime.InferenceSession("mnist-8.onnx", providers=["CUDAExecutionProvider"])
import psutil
import os
print(f"Using {psutil.Process(os.getpid()).memory_info().rss/1024/1024:.2f}M")

Prints 942.93M if import torch is commented out. Prints 5114.8M if import torch is left in.

(1G still seems like a lot for a dinky mnist model, but much better than 5G!)

Urgency

No response

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.13.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.4 and CUDA 11.6

Model File

No response

Is this a quantized model?

No

darintay commented 1 year ago

Some of that memory usage can be attributed to torch itself, but not all of it.

If I updated to print memory usage just before starting the onnx session:


# Without torch
Before: 50.22M
After: Using 932.09M

# With torch
Before: 1025.58M
After: Using 5109.57M

tianleiwu commented 1 year ago

@darintay, Try append the following code to inspect memory:

from psutil._compat import get_terminal_size
from psutil._common import bytes2human

def safe_print(s):
    s = s[:get_terminal_size()[0]]
    try:
        print(s)
    except UnicodeEncodeError:
        print(s.encode('ascii', 'ignore').decode())

p = psutil.Process( os.getpid() )

templ = "%-20s %10s  %-7s %s"
print(templ % ("Address", "RSS", "Mode", "Mapping"))
total_rss = 0
for m in p.memory_maps(grouped=False):
    total_rss += m.rss
    safe_print(templ % (
            m.addr.split('-')[0].zfill(16),
            bytes2human(m.rss),
            m.perms,
            m.path))
print("-" * 31)
print(templ % ("Total", bytes2human(total_rss), '', ''))

In my machine (ORT 1.13.1 with CUDA 11.7, torch 1.12.1+cu116). It is like the following: Without torch: CPU memory usage: before=66.6 MB, peak=1089.3 MB GPU memory usage: before=396.6MB, peak=778.3 MB

With torch: CPU memory usage: before=231.6 MB, peak=2371.0 MB GPU memory usage: before=396.6 MB, peak=1208.3 MB

Using the above method, many memory blocks after libcudnn_ops_infer.so.8:

Maybe it is related to cuDNN workspace allocation. However, change settings of CuDNN in ORT and Torch seems not help.

darintay commented 1 year ago

Thanks for looking into this!

I've used your code with my example to generate 3 files.

Unfortunately it looks like most of the memory is just in 'heap', so I'm not sure how helpful this will be.

mem_notorch_after_onnx_session.txt mem_torch_after_onnx_session.txt mem_torch_before_onnx_session.txt

darintay commented 1 year ago

It does definitely seem to be environment-related.

Using nvcr.io/nvidia/pytorch:22.10-py3 and installing onnxruntime-gpu in there, I get much more reasonable memory numbers running this script. I should have tried that earlier!

I'll see if I can narrow it down and/or work around it with different package versions.

darintay commented 1 year ago

Seems to depend entirely on the pytorch version

torch-1.9.0+cu111 (installed via "pip install torch==1.9.0+cu111 --find-links https://download.pytorch.org/whl/torch_stable.html")
Before ONNX: 1027.98M
After ONNX: 5124.55M

torch-1.10.1+cu113
Before ONNX: 330.80M
After ONNX: 4711.17M

torch-1.11.0+cu115
Before ONNX: 253.25M
After ONNX: 3090.81M

torch-1.13.0+cu116
Before ONNX: 251.27M
After ONNX: 2345.90M

The pytorch 22.10-py3 NGC image reproduces the 2000M memory usage vs 1000M without torch imported, which still seems high if there's any interest in investigating.

It still seems strange to me that having torch imported or not has this huge impact on the memory usage of my ONNX session, but at least I can get to more reasonable numbers with a pytorch upgrade.

tianleiwu commented 1 year ago

@darintay, in In mem_torch_before_onnx_session.txt(https://github.com/microsoft/onnxruntime/files/10026594/mem_torch_before_onnx_session.txt) you attached:

00007efecea60000         632.9M  r-xp    /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so
00007eff4d5ef000         196.3M  rw-p    /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so

In mem_torch_after_onnx_session.txt:

00007efecea60000         913.0M  r-xp    /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so
00007eff4d5ef000         196.3M  rw-p    /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so

That's about 1G in total.

microsoft / onnxruntime