lllyasviel / Fooocus

Focus on prompting and generating
GNU General Public License v3.0
38.52k stars 5.17k forks source link

[Bug]: amdgpu freeze resulting in GPU reset on large workloads #2656

Open infinity0 opened 3 months ago

infinity0 commented 3 months ago

Checklist

What happened?

I understand the amdgpu support is experimental, however I want to document this issue to guide others who run into it. My system specs:

Steps to reproduce the problem

When I ask Fooocus to do "too much", my display will freeze including keyboard/mouse and it appears I have to reboot the system. In fact, later I found this is not necessary, I can just log in via SSH and restart the display server e.g. systemctl restart lightdm. I observe this on dmesg:

[Mar27 23:05] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=6436600, emitted seq=6436602
[  +0.000160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 321623 thread Xorg:cs0 pid 321627
[  +0.000127] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[  +0.001546] amdgpu 0000:03:00.0: amdgpu: Guilty job already signaled, skipping HW reset
[  +0.000011] [drm] Skip scheduling IBs!
[  +0.000001] amdgpu 0000:03:00.0: amdgpu: GPU reset(10) succeeded!
[  +0.000005] [drm] Skip scheduling IBs!
[  +0.000004] [drm] Skip scheduling IBs!
[  +0.000002] [drm] Skip scheduling IBs!
[  +0.000003] [drm] Skip scheduling IBs!
[  +0.000002] [drm] Skip scheduling IBs!

Also, apparently there is something online called the "AMD GPU reset bug" - but my GPU does not seem to be affected by this in that I can trigger this bug many times, cause my screen to freeze, observe GPU reset(n) succeeded! via dmesg where n keeps going up by 2 each time, restart my display server via systemctl restart lightdm, and everything is fine afterwards, and I can start Fooocus again to do more stuff. In other words, this bug is not that bug.

What is "too much"? Well for me for 64 RAM normally it is like, running a Windows VM, watching a HD video, generating Upscale 2x with Performance = Quality on Fooocus, and running Upscayl at the same time. This is fine to avoid manually, I can just be careful when running Fooocus.

HOWEVER, you can also easily trigger it by giving Fooocus an input image that is quite big, even if the computer is doing nothing else. For example this one, 12 megapixels:

Causes GPU freeze, "Harvesting" oil painting by David Cox Jnr ![harvesting](https://github.com/lllyasviel/Fooocus/assets/78398/26be40d3-9f78-4140-b52d-3ec4f66eae87)

This is more annoying to avoid because sometimes you just want to drag and drop random shit from online into Fooocus and not have to worry about how big it is.

What should have happened?

Ideally, Fooocus should throw an exception in these cases, with something like "Out Of Memory" (or whatever the real reason is) rather than letting the GPU freeze up and reset. I'm not sure how feasible this is however.

What browsers do you use to access Fooocus?

Google Chrome

Where are you running Fooocus?

Locally

What operating system are you using?

Debian GNU/Linux

Console logs

dmesg logs are above. As for Fooocus logs, in fact Fooocus itself does not notice the problem, and there are no logs. The screen freezes, but you can run Fooocus inside a tmux session and attach to it by logging in via SSH, to confirm that there are in fact no logs and no errors. Nothing is output on the Fooocus tmux console, even though dmesg says that the GPU has already been reset. You can even tell Fooocus to quit with Ctrl-C after this, and it will tell you it's trying to exit, but this won't succeed and it just hangs there until you restart your display server.

Additional information

No response

infinity0 commented 3 months ago

Also, apparently there is something online called the "AMD GPU reset bug" - but my GPU does not seem to be affected by this in that I can trigger this bug many times, cause my screen to freeze, observe GPU reset(n) succeeded! via dmesg where n keeps going up by 2 each time, restart my display server via systemctl restart lightdm, and everything is fine afterwards, and I can start Fooocus again to do more stuff. In other words, this bug is not that bug.

Well, occasionally the GPU fails to reset then I do have to reboot the machine. So perhaps I'm also affected by the reset bug. This is a minor occurrence however, most of the time I can simply restart the display server without rebooting.

[Mar28 15:09] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[  +0.000145] amdgpu: failed to remove hardware queue from MES, doorbell=0x1802
[  +0.000002] amdgpu: MES might be in unrecoverable state, issue a GPU reset
[  +0.000003] amdgpu: Failed to evict queue 1
[  +0.000001] amdgpu: Failed to evict process queues
[  +0.000002] amdgpu: Failed to evict queues of pasid 0x8009
[  +0.000019] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ .. hangs here, we don't get "GPU reset(n) succeeded!" as above ..]
[ .. attempts to restart the display server hang, instead of succeeding as above .. ]

Anyway, it's clear that these are two separate issues.

mashb1t commented 3 months ago

Haved you searched for this issue in all of the other open discussions/issues for AMD? https://github.com/lllyasviel/Fooocus/issues?q=is%3Aissue+is%3Aopen+amd

Seems to be a duplicate of https://github.com/lllyasviel/Fooocus/issues/1690, please check his out.

infinity0 commented 3 months ago

As I explained in great detail both in that ticket and this ticket, this ticket is not a duplicate of that ticket. Please re-open.

mashb1t commented 3 months ago

I'm sorry to say that I personally can't help you to debug and get to the bottom of the issue as I don't have access to an AMD GPU. Hopefully the community can support here.

infinity0 commented 3 months ago

No problem, I am not expecting an easy fix soon - the ticket is more for documentation purposes and to help others, the important thing being you don't need to reboot if you can SSH in.

JohnAndrewsX commented 1 month ago

I am experiencing the exact same problem as infinity0. While performing simple tasks, everything works perfectly. However, when I use a larger input image, I encounter the exact same issue described by the original poster.

My Setup: OS: Garuda Linux BirdOfPrey Soaring x86_64 GPU: ASRock Radeon RX 7800 XT Phantom Gaming OC CPU: AMD Ryzen 9 7900X3D RAM: 48GB DDR8000 Swap: 16GB GLX version: Mesa 24.0.7-arch1.3 Browser: Brave, latest version Fooocus Installation: Locally in a Fedora 39 container via Distrobox.

abclution commented 1 month ago

In my experience, updating the linux kernel amdgpu firmware (from kernel source helps with the reset bug, I used to get it all the time with the firmware included in Debian 12 + Proxmox not even doing any AI stuff.

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/

Its a bit tricky cause you want to be running pretty recent kernels as well.

But the other bug where the desktop becomes unresponsive etc happens to me on all the AI programs I can always tell its going to happen because my mouse starts stuttering hard as a warning and then dies completely. I do belive that memory used for the desktop gets overrun causing it to die, cause as in the other case I usually remote in via ssh and reboot the machine.

In fact there is a very new version of the firmware i have yet to install

image