AMD GPU crashing, halting system, due to high usage

Tectract commented 8 months ago

Read Troubleshoot

[x] I admit that I have read the Troubleshoot before making this issue.

Describe the problem

AMD GPU can't handle the high memory and utilization. Computer crashes when I try to do "inpainting" with instructive details, almost every time. There's no console log because the screen simply turns black when the GPU crashes, and I have to restart. Sometimes I have to reinstall the graphics drivers because it got borked.

Could there be an option to run at like 80% max utilization and limit memory so it doesn't overflow the GPU and crash it?

Tectract commented 8 months ago

One interesting thing here is that when I run an image-generation with foocus, the GPU mem utilization goes up near 100%. Then it STAYS there until I close Fooocus in the background. To me this indicates memory is leaking or not being "let go" of, after an image is generated, potentially exacerbating mem overrun issues...

Tectract commented 8 months ago

--attention-split seems to be helping a little

eddyizm commented 8 months ago

Might be helpful if you post all your hardware specs for those who will be able to look into this a little deeper. I have seen a lot of issues with AMD GPU's

mashb1t commented 8 months ago

As of the last 3 versions there were also many optimisations and bugfixes concerning AMD GPUs. Please make sure you're using thr latest version of Fooo us and provide the system specs as pointed out by @eddyizm and the console log until the point it crashes (if possible). Thanks!

Tectract commented 8 months ago

My CPU is the AMD Ryzen Threadripper RT 3960X 24-core, and my GPU is the AMD Radeon RX 6900 XT so this is a beast of a machine, with 64 GB Ram. I'm using this machine to do video game design using Unreal and Unity, so it's a $10k work machine, not a normal desktop.

I'm not getting any errors in the logs before it shuts down and the screen goes black. It is definitely a memory, not heat issue. Looks like it's getting an overmemory / leakage or bad instruction potentially within the driver routines, not sure. I will investigate the Windows logs and see if I can find more for you.

It's happening particularly on inpainting, it seems, but I also had it happen on normal faceswaps. I just downloaded fooocus from this repo link a couple days ago so I think I have the newest version...

Tectract commented 8 months ago

Maybe I should try adding another 64GB of memory to my machine, lol

mashb1t commented 8 months ago

can you please double-check in the browser tab title or the terminal output if you're using 2.1.859? Do you use any args such as --always-high-vram etc.?

Tectract commented 8 months ago

hmmmm, ok I'm using 2-1-831, which I downloaded from the front-page link yesterday. I can try to upgrade to the newest version. I'm using the args --directml --always-cpu and --split-attention.

I had to use --always-cpu as a workaround to get faceswapping to work without erroring out complaining about expecting everything to be on one device, I commented on an issue related to that... I tried --split-attention to try and help the memory bug thing but it doesn't really seem to be helping, I guess.

EDIT: ok, I see in the console when I run it now:

Fooocus version: 2.1.859

MetatronL commented 8 months ago

Isn't the --always-cpu flag going to force the work on the CPU? If yes, then it's predictable to be that slow.

Tectract commented 8 months ago

No it's definitely transferring the model to the GPU for rendering at some point in the process, and the GPU memory (16GB) maxes out and it starts heating up as it works. But it's crashing way below max temps. I'm not seeing GPU shared mem (32GB) being highly utilized though and maybe that's part of the problem. The CPU has a couple cores chugging away but it's not maxing out all 48 threads or anything.

JarekDerp commented 8 months ago

I tried it on my machine, 6700XT 12GB + 32GB of RAM and it works fine. I run it using Extreme Speed preset (that uses LCM Lora) and only takes 8 steps. Image size 896×1152 ∣ 7:9, two images selected (to create variations). It run first 4 steps on GPU, then done some work on CPU for a while, then the last 4 steps in GPU again (refiner I guess).

On my machine, using "--split-attention" on directml, used to give me some problems so I'm using the default quad attention and not complaining. On my machine I'm only using

--directml --disable-xformers

and on my other machine that doesn't have a dedicated graphics card I'm using

--always-cpu

And all seems to work fine.

JarekDerp commented 8 months ago

when I run an image-generation with foocus, the GPU mem utilization goes up near 100%. Then it STAYS there until I close Fooocus in the background.

That's normal with DirectML. As I understand, it "reserves" that space in the memory, so it's not letting it go when switching models or when the generation finishes. The only way to release it is to close down python.

I'm not seeing GPU shared mem (32GB) being highly utilized though and maybe that's part of the problem.

That's also normal. DirectML uses either RAM or VRAM without touching the shared memory.

it shuts down and the screen goes black

Ahh, sorry. I though you said that you're getting black images as output when running the inpainting. This definitely seems like a driver problem. Alternatively, maybe something wrong with your PSU? Does it work fine under stress test? Did you do any OC or undervolting? Drivers all up to date?

Tectract commented 8 months ago

The drivers are managed by AMD Adrenaline, and I have actually reinstalled them and BIOS-spec'ed the RAM to work properly with Unreal Engine, recently. The CPU is overclocked 2%, I believe this was the result of an auto-optimization test within BIOS. I guess I could see if there is a newer Windows AMD Adrenaline driver but I suspect this problem has to do with a library implementation of a driver instruction set in one of the python libs that just doesn't play well on the Threadripper / AMD GPU combo. It seems to happen at random times, sometimes right when the image gen algo starts, and sometimes like at 80% completion, so I suspect it's a memory overrun thing, but can't be sure. Sometimes it happens when I first start up the machine and haven't done any previous image gens. PSU is an interesting thought, it's possible I'm just drawing too much juice and I should look into that.

this thread about forcing CPU processing of random brownian seeds looked interesting. I am a decent python programmer but just don't know enough to be of great help here.

I will try not using the --split-attention and instead try --disable-xformers and "ultra-fast" setting and see if it helps, and report back in a bit.

JarekDerp commented 8 months ago

Directml is a bit finicky and Microsoft hasn't made any updates to it in months. There's a ton of problems with it.

Couple days ago I was trying to solve a problem with IP Adapter running on directml with the developer and Directml just refused to work. I wrote bad python syntax and it started to work on Directml but then it would work on Nvidia cards... In the end we gave up.

I think the only good solution is to run SD on Linux since it has ROCm support. It's about 3 to 6 times faster last time I checked. I installed Ubuntu LTS on a pendrive and use it sporadically. Linux subsystem for Windows or Virtual Machine didn't work unfortunately, so you would have to boot your PC into Linux. There's hope we will get ROCm on Windows but nobody knows when that happens. Last time I check the progress is promising.

Tectract commented 8 months ago

My linux workhorse machine is sort of reserved for processing financial datastreams for algo-trading, so I don't want to potentially disturb it, plus re-installing linux drivers can be so painful, if you bork them. Maybe I can try installing on an old laptop using a backup drive, lol.

Tectract commented 8 months ago

I'm upgrading my PSU and will report back if that cures my GPU "blackout" issue. It's 100% possible this beast of a machine is drawing over 1000W... Are there some good software tools for monitoring the state of the PSU?

mashb1t commented 8 months ago

@Tectract You can check wattage using https://openhardwaremonitor.org/, but i'd recommend to get either a smart plug or a power meter to put it between the PC and the wall. This way you can notice at least spikes and total power comsumption.

Tectract commented 8 months ago

Could I just run it in a linux vm to take advantage of the ROCM implementation under Windows? Is there any way I can just like limit the GPU to only 50% or something so it doesn't try to draw so much power?

mashb1t commented 8 months ago

AFAIK this is currently not possible. You can either use your CPU or your GPU and only limit the amount of VRAM of your GPU but not processing power (consumption). Using another VM will most likely not make things more stable, but feel free to try. I'd rather suggest to check out the docker implementation in https://github.com/lllyasviel/Fooocus/pull/1418 than setting up a complete VM, if you want to go that route.

Tectract commented 7 months ago

I took a closer look at the specs on my GPU and my CPU and power requirements, and I really don't think my machine is underpowered. The max draw for the CPU with all 24 cores at 100% processing is around 280W and the GPU is rated to draw 300W max, and I have a 1000W PSU, so it's almost double-rated for the max power I could be drawing here. And in reality I'm not using that much CPU here, maybe only a few cores are even close to maxing out.

So I'm starting to suspect the issue here is related to drivers and library implementations of GPU driver accessor calls, and potentially graphical memory overruns caused by DirectML libraries on this CPU/GPU combination. This would cause the GPU to potentially halt video output while the rest of the system is unnaffected, which is the symptom I'm having, and it also explains why the graphics drivers would be borked upon restart, as the GPU internal memory is getting corrupted upon memory overrun occuring.

mashb1t commented 6 months ago

@Tectract Is there anything you still need support with and where we can help you? If not i'd close this issue in the next few days.

Tectract commented 6 months ago

I would say it's still an undiagnosed and critical bug. I would leave it open until it can be investigated more and maybe reproduced or linked to other similar bugs, until it is fixed. But that's just how I run my repos, lol.

cezzarCz commented 6 months ago

I'm having a similar problem, when Fooocus indicates that the model is being moved to the GPU, my computer simply shuts down. Specs: RX 580 8gb, Xeon 2667 v4, 16 RAM, Windows 10

Tectract commented 6 months ago

I think we should consider moving this bug upstream to the DirectML bug reporting system?

infinity0 commented 5 months ago

@Tectract is your computer entirely unresponsive or can you SSH into it to restart the X session? If the latter then maybe it's the same issue as #2656.

Tectract commented 5 months ago

The computer is still on but the GPU crashed. GPU internal memory corrupted requiring driver reinstall after reboot. This is a windows computer not a linux machine, I don't have SSH installed on it. It's likely a very similar issue though. Looks like a memory overrun or bad memory instruction in the DirectML libs for this gpu architecture.

infinity0 commented 5 months ago

OK, well mine doesn't require a driver reinstall and the computer is otherwise totally fine after I simply restart the display server, no reboot required, the GPU can even continue generating more stuff with Fooocus. The details are different, so probably a different issue, but there's a small possibility it's the same issue.

lllyasviel / Fooocus

AMD GPU crashing, halting system, due to high usage #1690