[Performance] Find out why the GPU memory allocated with `CUDAExecutionProvider` is much larger than the ONNX size

fxmarty commented 1 year ago

Describe the issue

I have a model that is 4137 MB as a .onnx, exported from a PyTorch's ScriptModule through torch.onnx.export.

When loading the ONNX model through an InferenceSession using CUDAExecutionProvider, 18081 MB of memory gets allocated on GPU.

When loading the .pt weights into a ScriptModule on GPU, PyTorch allocates only 5386 MB, which is more reasonable compared to the .pt size that is 4267 MB.

Hence, I am wondering how is memory allocation on GPU performed with CUDAExecutionProvider, and why it may be more than x4 the ONNX size. It could very well be that my ONNX is ill-formatted, so I'd like to find out where.

To reproduce

Use this script to find out peak GPU memory allocation (MB):

a=0
while true; do 
b=$(nvidia-smi --query-gpu=memory.used --format=csv|grep -v memory|awk '{print $1}'| sed -n 2p) # gpu number 1
[ $b -gt $a ] && a=$b && c=$(bc -l <<<"1.048576*$a") && echo $c
sleep .5
done

Use this script to load the ScriptModule:

import torch
import time

scripted_pipeline = torch.load("scripted_sd_cuda.pt")
scripted_pipeline = scripted_pipeline.to("cuda")
time.sleep(3)

Use this script to load the ONNX model:

import onnxruntime as ort
import time

providers = ["CUDAExecutionProvider"]

session_options = ort.SessionOptions()
session_options.graph_optimization_level = (
    ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
)

session = ort.InferenceSession("stable_diffusion_pipeline.onnx", providers=providers, sess_options=session_options)
time.sleep(3)

Urgency

normal

Platform

Linux

OS Version

Linux 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.13.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.7

Model File

use:

wget https://felix-playground.s3.eu-west-3.amazonaws.com/onnx_model.zip
wget https://felix-playground.s3.eu-west-3.amazonaws.com/stable_diffusion_pipeline.onnx_data
wget https://felix-playground.s3.eu-west-3.amazonaws.com/scripted_sd_cuda.pt

The stable_diffusion_pipeline.onnx_data is a ~4 GB file with external data.

Is this a quantized model?

No

yuslepukhin commented 1 year ago

CUDA allocations are expensive, so ORT caches them in its own Arena. It attempts to re-use it, but the overall footprint is high.

You can try disabling it with sessionOptions.DisableCpuMemoryArena() and see how much your performance depends on it. CPU is a misnomer as the arena was never meant to be for CPU memory.

To deal with the memory growth, there is a shrink feature that may attempt periodically to shrink the arena's memory. It is not perfect.

You can read about it here.

elephantpanda commented 1 year ago

Same with DirectML. Can use 5-6GB of GPU VRAM plus lots of RAM for about 2GB of onnx files. The DirectML people are saying they are working on reducing the memory size. So cross fingers.

fxmarty commented 1 year ago

@yuslepukhin @pauldog Thank you for your advice!

I tried using the CUDA provider options {"arena_extend_strategy": "kSameAsRequested", "cudnn_conv_algo_search": "HEURISTIC"} with no success.

The

run_options = onnxrt.RunOptions()
run_options.add_run_config_entry("memory.enable_memory_arena_shrinkage", "cpu:0;gpu:0")
session.run([], {input_name: x}, run_options)

did not help either.

This is a good reference: https://github.com/microsoft/onnxruntime/issues/14038#issuecomment-1368306161

I still don't really understand what's the difference with PyTorch, where the huge memory allocation comes from, and why Pytorch is more memory efficient out-of-the-box. Digging into provider options to have something (maybe) work is not very user friendly. I think I'll give up for now.

yuslepukhin commented 1 year ago

Looking at the title of this issue, I am not sure you are looking for a correct thing. Are you comparing ONNX file size with the memory requirements during runtime?

fxmarty commented 1 year ago

Yes, roughly - I take it as a proxy as I don't expect to be caching activations with skip connections or anything. Or better, I am comparing PyTorch's memory usage vs ONNX Runtime on CUDAExecutionProvider.

fxmarty commented 1 year ago

I still haven't solved the issue. GPT-J loads fine on 12 GB memory with PyTorch, 20 GB is not enough with ORT in ort.InferenceSession.

Edit: even 40 GB is not enough. It does not make for a good user experience if one needs to tweak undocumented parameters, read github issues and such to have something work.

fxmarty commented 1 year ago

For reference, the model is https://huggingface.co/fxmarty/gpt-j-6B-onnx/tree/main . cc @tianleiwu @yufenglee

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
import time

model_id = "gptj_onnx"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

print("loading model")
start = time.time()
model = ORTModelForCausalLM.from_pretrained(model_id, provider="CUDAExecutionProvider")
print(f"Loading took: {time.time() - start:.2f} s")

pranavsharma commented 1 year ago

@fxmarty have you tried disabling the arena altogether?

tianleiwu commented 1 year ago

@fxmarty, for GPT-j, I saw many Cast nodes from fp16 to fp32:

That will need more memory to run in fp32. General guideline is to run graph optimizations first (with CUDA EP), then convert graph to fp16.

For GPT-j, ORT do not have GPT-j specified optimizations so probably only partial optimization.

Also, I saw If operator is to used to combine two subgraphs. It is recommended to export two onnx models (one for initial run without past state, another with past state) separately. Those two ONNX models can be used in BeamSearch operator as subgraphs, which will call them in sequential order. In this way, you need not use If operator.

fxmarty commented 1 year ago

Thank you for your suggestions @pranavsharma @tianleiwu!

The model was exported with torch.onnx.export already with fp16. I did not use ORT fp16 conversion due to the issues I hit with incomplete symbolic shape inference we discussed in an other issue.

Exporting in fp16 without running ORT optimizations, we don't have these InsertedCast. So it appears the InsertedCast nodes were inserted by optimize_model, and it seems that optimize_model should not be used when exporting already in fp16 from pytorch. Is it correct?

Also, the thing is that it is not about running the model. It is just about loading it into an InferenceSession:

import onnxruntime as ort

session = ort.InferenceSession("decoder_model_merged.onnx", providers=["CUDAExecutionProvider"])

This one allocates 66419 MiB and will OOM if fewer GPU memory than that is available.

Running

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to("cuda")

allocates 12157MiB.

I would like to avoid using BeamSearch or GreedySearch operators, as they are I think only supported by CPUExecutionProvider and CUDAExecutionProvider. But you are right that we could support them as an option in the ONNX export - especially if those work with onnxruntime-web.

I will try disabling the arena.

cqray1990 commented 1 year ago

@pranavsharma how to disabling the arena?

igormis commented 1 year ago

I have also some issues when doing inference using onnx optimized xlm-roberta model. I have following settings:

cuda_provider_options = {"arena_extend_strategy": "kSameAsRequested", "do_copy_in_default_stream": False, "cudnn_conv_use_max_workspace": "1"}
cpu_provider_options = {"arena_extend_strategy": "kSameAsRequested", "do_copy_in_default_stream": False}
execution_providers = [("CUDAExecutionProvider", cuda_provider_options), ("CPUExecutionProvider", cpu_provider_options)]
sess = rt.InferenceSession("roberta_onnx_model/__MODEL_PROTO.onnx", providers=execution_providers)

For the run options I have:

run_options.add_run_config_entry("kOrtRunOptionsConfigEnableMemoryArenaShrinkage", "cpu:0,gpu:0")
run_options.add_run_config_entry("kOrtSessionOptionsUseDeviceAllocatorForInitializers", "1")

and here is the inference part: logits = sess.run([label_name], {input_name: X_test_sample}, run_options)[0] The GPU RAM starts with 3.5GB and after a while it increases to 7 GB and more. Any suggestions on this

lukas-folle-snkeos commented 1 year ago

Is there a way to associate the memory usage per layer and to identify problematic ones?

tianleiwu commented 1 year ago

@lukas-folle-snkeos, You can build ORT from source and enable node input and outputs dumping, and apply some debug code (or build the branch): https://github.com/microsoft/onnxruntime/commit/6c0a0aacfdc3b63c99921e7ead69bbeb46dbb3f4 Then you can see the memory allocation.

peterukk commented 6 months ago

Only vaguely related, but in case someone with OOM issues during inference stumbles upon this thread: I encountered strange behaviour (a bug?) with ONNX models using multiple inputs (i.e. outputs = session.run(None, { inp_name1: x1, inp_name2: x2,..}). When running inference with CUDAExecutionProvider I got "Failed to allocate memory.." errors at way, way smaller batch sizes than expected. This was fixed by changing my model so that it only had one (larger) array as input (from which I then extracted x1,x2... inside the model). I'm using onnx 1.12.0, tf2onnx 1.14.0 and opset 17.

microsoft / onnxruntime