microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.2k stars 292 forks source link

Amd rx 7900XTX vram leak #412

Open Kademo15 opened 1 year ago

Kademo15 commented 1 year ago

Hello there i am new to neural networks and ai, and i tried stable diffusion but noticed my vram is on 24/24 GB all the time. I then looked at issues on their github and found a few that have the same issue. They all tell me that it is because torch-directml has an issue with vram leaks and bad vram management. I tried searching here and found none that actually talks about this issue concerning stable diffusion. I am on Window 11 btw.

I am basically asking for help:

Where could I find more information about this issue. Can i maybe fix it myself Where should I keep an eye on to check if a fix is released.

And I also created this issue to let you guys know that this issue exists, I tested it on 2 amd cards and the same memory leak occurred

Thanks you very much for the help

iDeNoh commented 1 year ago

I can confirm this issue happens on my 6700xt, 12/12GB immediately used plus an extra 6 GB of system memory, and the only way to clear it is to re-launch. running the latest version of windows 10.

fbz0081 commented 1 year ago

7900xtx user win10 here also. The exact same issue where there is high vram usage 24/24gb happens to me also in stable diffusion. Have looked around and it happens to a few other 7900xtx owners too.

oscartorres9 commented 1 year ago

exact same issue with my 7900xtx xfx in win 10. first couple rounds testing is very fast, but after a while, it ran out of all the 24GB vram, then it turn into really slow, and did not release any memory until i exit the whole app manuly. plz fix the problem. ths.

mvrxxx commented 1 year ago

Same issue with 7900 XTX and Windows 11, while running stable diffusion sometimes will generate images and sometimes it says that there isn't memory available. I have constantly 23.8/24GB vram usage while using torch-directml 0.2.0.dev230426

NeedsMoar commented 1 year ago

For everyone wondering the python torch DirectML API has no way of obtaining total or in-use GPU memory for AMD devices for some mind-bogglingly stupid reason (it's using DirectX 12 for the allocation and that can do it), and apparently no way of explicitly moving a model off of the GPU when the program is done with it. The function to get available memory in the python -> native interface file for torch_directml returns an array of zeros.

In my experience it doesn't respect in-use vram for the display either and will sometimes copy garbage to a section of the display buffer and glitch part of the screen for a frame. It can cause issues like video driver resets if you accidentally let the computer run with the UI open not doing anything with comfy at least.

At this point the only decent options are 1) compile your torch models to optimized ONNX which will be much faster anyway and doesn't seem to have this issue. One of the node packages for ComfyUI has an ONNX loader if the base nodes can't do it. You'll need to install the onnx directml runtime (and olive to convert and optimize models), and wait for 4... It looks like tensorflow can deal with memory better too.
2) Restart whatever UI you're using every so often, and wait for 4... 3) If you're not doing anything more complicated than a model + single LoRA and don't care about resolutions other than 512x512, 768x512, and 512x768 (the only tuned resolutions aside from 768x768 which is for the lower resolution version of SD 2.1 that it force-loads in place of the real one) or the model necessarily working correctly since it doesn't seem to load config .yamls and determine clip-skip, run Nod.ai's Shark; it'll crank out images faster than heck and not OOM on you.
4) ... and of course wait for AMD to finish the MIOpen Windows port now that Windows has a RoCM and HIP compatible driver so the pytorch people can enable the RoCM backend on Windows and we don't have to deal with cobbled together half-frameworks or tons of non-working features in existing python UIs

NeedsMoar commented 10 months ago

Contrary to that post I figured out a while ago that it actually does have a way of getting used memory (by DirectML). The get memory function in torch_directml which appears to just return an array of zeroes if you run it at the start of a program can be called with a tile size of 1GB, then the array elements will be filled up with values in [0.0, 1.0] starting from the beginning, each representing a gigabyte of actually-used memory. Calling it with no tile value is supposed to use some other size that obviously isn't correct, but any user-specified size corresponds with the above behavior. I only suggest 1GB because it's easier to work with. The problem is that everything else still sees all memory taken (except for SysInternals ProcessExplorer which somehow manages to show the correct amount used), and there's no way to get the actual total memory without guessing based on the card model which will break constantly.

Freeing items is just done by deleting them in python. I posted about this over on the comfyui bugzilla a while back but the inability to get total vram so something useful can be done with the numbers and the unneeded complexity of converting an array of floats into a representation of bytes... actually the total may be gettable by abusing this assuming it doesn't error out when called with a tile size larger than the GPU will handle and maxes out at that point instead... create a single ~1GB tensor that fills a 1GB bin with its overhead as closely as possible without much spillage into the next one... then set the tile size to progressively higher common GPU memory amounts and inspect the result for how close it is to 1/memory_guess. The correct size or at least the available total should show up as almost exactly that percentage, and from then on the function can be called with that number to get the percentage occupied and copy tensors / models off the GPU then delete them as needed, but it's a huge pain to deal with considering the performance issues.