microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.55k stars 2.91k forks source link

Massive Memory Leak c# DirectML #14466

Closed elephantpanda closed 1 year ago

elephantpanda commented 1 year ago

Describe the issue

Create some big inference sessions. (say >2GB).

This increases memory by 2GB on the GPU and also in the RAM by 2GB.

Call session.Dispose();

This clears the memory from the GPU but keeps memory on the RAM.

Thus 2GB of RAM is not being deleted.

Close the program. The RAM is finally freed.

Why is this a problem? When working in an IDE such as Unity, there needs to be a way to free the RAM without closing the IDE. Each time you run the program without exiting the IDE, the RAM increases until memory error occurs.

What could be happening? Maybe it is loading the onnx file onto RAM but not freeing it after putting it on the GPU. Just a guess.

Possible Reason This seems to happen mostly with big 2GB files. Which I noticed often have their weights in separate files. So perhaps the runtime is freeing memory from the main file but is not freeing the memory of the linked files such as weights.pb

To reproduce

Using C# onnx runtime latest developer build. 1.14 and also tried it on 1.15

Create some sessions. With DirectML. (Do nothing with them) Dispose of the sessions.

Have a look at the GPU and ram monitor in Task manager.

Close the program.

Urgency

No response

Platform

Windows

OS Version

Windows 10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

Microsoft.ML.OnnxRuntime.Managed.1.15.0-dev-20230128-0428-7aecb2150f

ONNX Runtime API

C#

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

elephantpanda commented 1 year ago

I have found a partial fix by making sure the onnx file is one single file. This fix only works for onnx files <2GB.

uchuusen commented 1 year ago

I've noticed the same issue in Python as well. If I have an app that loads a stable diffusion onnx model and then unloads it repeatedly, for the purpose of switching models or clearing out vram, it will seem to lose track of a couple gigabytes of system ram every time it does so. In my case, I'm using an integrated gpu, an AMD Vega 10 on a Ryzen 3700U chipset.

RyanUnderhill commented 1 year ago

@fdwr Is this a DirectML issue or is this C#?

elephantpanda commented 1 year ago

@fdwr Is this a DirectML issue or is this C#?

I think it's a general issue of the weights.pb file not getting released from memory when the onnx file is in multiple parts.

RyanUnderhill commented 1 year ago

Is it possible to create a minimal repro scenario and paste it here? Just so we're sure we're doing the same thing you are.

elephantpanda commented 1 year ago

The scenario is as follows:

SessionOptions so = new SessionOptions
                {
                    ExecutionMode = ExecutionMode.ORT_SEQUENTIAL,
                    GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED
                };
                so.AppendExecutionProvider_DML(); 
                    session = new InferenceSession("model.onnx", so);
session.Dispose();

If you try this with an onnx file that is in two parts model.onnx (100Mb) with separate weights.pb file (1.5GB). It seems to not release the weights.pb file from memory. Whereas if it is in a single file model.onnx (1.6GB) then it all gets cleared up from RAM.

This is not a problem for me anymore since I am now making sure that I only use single onnx files and not ones that are in multiple parts. Mind you, this may be a problem for people running bigger onnx files since they must be split if they are over 2GB.

Seems like a simple bug in which auxillary files are not getting released.

Here is a conversion script I used.

fdwr commented 1 year ago

@fdwr Is this a DirectML issue or is this C#?

Ryan: πŸ€” My hunch is that it's a general GC resource lifetime issue (seeing both C# and Python repros in the comments), because the DML EP has no awareness of or differences in behavior for external vs internal weights - it just uses whatever was passed to it from ORT. AFAIK, we never did any work on the DML EP to support external weights, and so whoever added support must have done it in a way that it works with all the EP's generically.

elephantpanda commented 1 year ago

In general can I just say that it would be beneficial to release as much RAM as possible after having loaded in the values form the files including releasing file pointers and anything else.

Many people are working with 1-2GB model files, on consumer PCs which in general have an average of 8GB VRAM and 8-12GB RAM if they're lucky. So every bit of memory counts.

Thanks!

elephantpanda commented 1 year ago

@fdwr Is this a DirectML issue or is this C#?

Ryan: πŸ€” My hunch is that it's a general GC resource lifetime issue (seeing both C# and Python repros in the comments), because the DML EP has no awareness of or differences in behavior for external vs internal weights - it just uses whatever was passed to it from ORT. AFAIK, we never did any work on the DML EP to support external weights, and so whoever added support must have done it in a way that it works with all the EP's generically.

Hello, is anyone working on this? It seems like lots of people have pointed out the same problem but nobody at Microsoft knows how to fix it? Seems like a 5 minute fix for the right person. Just unload the external weights files once they've been consumed.

pranavsharma commented 1 year ago

The session should have released all memory on destruction. There's no known issue here. Have you tried using our C/C++ API (which has deterministic destruction)?

fdwr commented 1 year ago

Sorry Mr @pauldog, but I haven't reproduced it :/. I tried with ORT 1.9 and ORT 1.14.1 using Stable Diffusion unet with a separate weights.pb file. The memory grows huge (gigabytes), but then falls back down after the Dispose (but before the last WriteLine statement). Here's my code in entirety:

image

using System;
using System.Collections.Generic;
using System.IO;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

namespace OrtTestApp
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Begin...");

            SessionOptions sessionOptions = new SessionOptions
            {
                ExecutionMode = ExecutionMode.ORT_SEQUENTIAL,
                GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED,
                EnableMemoryPattern = false,
            };
            sessionOptions.AppendExecutionProvider_DML(0);
            InferenceSession session = new InferenceSession("D:\\stable_diffusion_onnx\\unet\\model.onnx", sessionOptions);
            session.Dispose();

            Console.WriteLine("End...");
        }
    }
}

And to confirm, this does not happen with the CPU provider? (removing AppendExecutionProvider_DML)

elephantpanda commented 1 year ago

Your program seems to be flawed because the program closes after it writes "End". Whereas the memory leak is for when the program stays open. Try adding a Console.ReadLine(); after the Console.WriteLine("End...") and repeat the experiment.

I am using C# in Unity. Using the same code:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using System;
using System.IO;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

class OrtTestApp : MonoBehaviour
{
    void Reload()
    {
        Debug.Log("Begin...");

        SessionOptions sessionOptions = new SessionOptions
        {
            ExecutionMode = ExecutionMode.ORT_SEQUENTIAL,
            GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED,
            EnableMemoryPattern = false,
        };
        sessionOptions.AppendExecutionProvider_DML(0);
        InferenceSession session = new InferenceSession("model.onnx", sessionOptions);
        session.Dispose();

        Debug.Log("End...");
    }

    private void Update()
    {
        if (Input.GetKeyDown(KeyCode.Space))
        {
            Reload();
        }
    }
}

screenshot_taskmanager

As you can see about a memory leak of over 1GB.

Perhaps its just a Unity thing, though I don't see how. I am using the latest dev build of onnxruntime directml and the onnxruntime managed library.

Same experiment but without external weights file: (no memory leak here) screenshot_taskmanager2 Therefor my only conclusion can be is that the external weights files are not getting cleared by Dispose.

fdwr commented 1 year ago

@pauldog I repro it when running a longer loop. It's definitely, like you say, related to these separate weight files.

βœ… CPU EP + C# βœ… CPU EP + C++ βœ… DML EP + C# + single model βœ… DML EP + C++ + single model ❌ DML EP + C# + separate weights ❌ DML EP + C++ + separate weights (e.g. SD1.5)

image

ssube commented 1 year ago

fwiw, I don't think this behavior is specific to C# or DirectML: I have a Python program doing repeated inferences and it reliably runs out of memory after about 95 runs, with a very similar memory graph, on both CUDA and DirectML (attempting to test on ROCm as well). It does seem to be related to models with external weights, I have not seen it on smaller single-file models, but the SD v1.5 UNet hits about ~300GB of virtual memory on the CPU side within 100 runs, even if I close the sessions/options. Unloading and reloading the model after 10 runs frees up VRAM, but does not fix the CPU side. Restarting the worker process entirely after 10 runs does free up all of the memory, but requires multiprocessing and reloading everything.

elephantpanda commented 1 year ago

yes, it needs to be fixed to be able run model >2GB without leaking memory.

pranavsharma commented 1 year ago

I can repo this on CUDA as well. Will try to take a look this week.