lllyasviel / stable-diffusion-webui-forge

GNU Affero General Public License v3.0
8.55k stars 838 forks source link

It seems that the refiner has a memory bug #2308

Closed lvyonghuan closed 1 week ago

lvyonghuan commented 1 week ago

I noticed that in the latest commit, the refiner was re-enabled. But when I open the refiner, I often get operating system level errors. The performance is that python stops running directly. When I stopped using the refiner, the error stopped happening.

This error happens by chance, but I'm almost certain it's because I turned on the refiner. When I turned on the refiner, the error occurred two or three times in half an hour. After I closed the refiner, the error did not occur again for three hours.

https://github.com/lllyasviel/stable-diffusion-webui-forge/pull/2192


It looks like I've found the problem.

https://github.com/lllyasviel/stable-diffusion-webui-forge/issues/2308#issuecomment-2470536030

s4130 commented 1 week ago

For me, switching models causes the virtual memory to keep increasing. Could you observe this?

lvyonghuan commented 1 week ago

For me, switching models causes the virtual memory to keep increasing. Could you observe this?

I don't think the problem is caused by virtual memory. I only remember one time when the refiner was enabled, I just clicked the start button (I had not switched models at this time), and then suddenly the screen went black twice, and then a system-level error message popped up, showing that python stopped running.

I'll try to reproduce this issue later.

lvyonghuan commented 1 week ago

For me, switching models causes the virtual memory to keep increasing. Could you observe this?

Maybe you are right. There does seem to be some issue here regarding virtual memory.

I have previously terminated the generation process directly when encountering several unsatisfactory compositions. This time I terminated and watched the changes in virtual memory. I noticed that the amount of committed memory of the python process seemed to be spiraling up, not down. I wonder if the memory occupied by the refiner model is not cleaned up normally when it terminates?

I'm still testing it, and I don't know if it's causing the problem. I've only recently started to get into the way operating systems manage memory.

lvyonghuan commented 1 week ago

It does seem to be a memory related issue.

I ran program over and over several times. Sometimes the generation is interrupted, and sometimes it is allowed to complete normally. Whether interrupted or completed normally, committed memory continues to grow.

Finally an error was reported, program crashed.

图片

图片

图片

图片

图片

图片

erew123 commented 1 week ago

There might be a wider memory issue. I was using X/Y/Z to load different checkpoints for the same prompt. No refiner was in-use. I've had this occur once or twice over the last week or so, but havnt been using forge too much anyway.

Using the X/.Y/Z to load in different checkpoints, I've had my Linux/Ubuntu Nvidia driver crash and throw me back to the login screen. It doesnt happen on the first run of using multiple checkpoints (about 6 checkpoints all SD 1.5), but maybe the 3-5th run. FYI I am also doing a 512x768 with a hi-res upscale 1.5x with Euler a.

So Im just throwing this into the pot as something Ive seen that may or may not be related.

image

The only issue I can spot in the logs however is:

[drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
Nov 12 14:06:56 ubuntuuser-desktop kernel: 

Which seems to be a Nvidia Linux driver issue! And who knows if there is a memory leak in that or if its even related! Obviously Windows will behave differently

GPT claims:

"Failed to grab modeset ownership" Errors:

These errors from nvidia_drm (NVIDIA’s Direct Rendering Manager) indicate the driver is failing to control display modes (resolutions, refresh rates, etc.). This may be caused by an issue with the NVIDIA drivers or with how the GPU is interacting with the display server (Xorg or Wayland).

This issue can sometimes lead to display crashes or being logged out unexpectedly, as the display server fails to maintain a stable connection to the GPU.

shaun-ba commented 1 week ago

Does refiner work for flux? what is its purpose exactly?

lvyonghuan commented 1 week ago

Does refiner work for flux? what is its purpose exactly?

I haven't tried it. refiner is a concept proposed by sdxl, and flux may not work. You can give it a try.

According to my shallow understanding, the text2img result is made into img2img using the refiner model. The details are better than simple text2img.

I just started using it and haven't found any tricks yet.

s4130 commented 1 week ago

Does refiner work for flux? what is its purpose exactly?

If you start with an anime model and then choose a realistic model in the refiner, it can achieve effects that a realistic model alone cannot reach. Additionally, starting with a more 'obedient' model and then layering it with your preferred model may yield better results.

DenOfEquity commented 1 week ago

Please take a look at #2315. I'd like the method tested on a range of setups before merging; but for me it prevents Committed memory blowing up, and has no downsides. The relevant change is only 3 (+1 comment) lines added to one file (backend/memory_management.py), so easy to manually add.

s4130 commented 1 week ago

Please take a look at #2315. I'd like the method tested on a range of setups before merging; but for me it prevents Committed memory blowing up, and has no downsides. The relevant change is only 3 (+1 comment) lines added to one file (backend/memory_management.py), so easy to manually add.

Thank you, this has been really helpful and has solved a problem that's been bothering me for months.

lvyonghuan commented 1 week ago

Please take a look at #2315. I'd like the method tested on a range of setups before merging; but for me it prevents Committed memory blowing up, and has no downsides. The relevant change is only 3 (+1 comment) lines added to one file (backend/memory_management.py), so easy to manually add.

But I suddenly thought whether there is some repeated loading model in the refiner? I'm using the same set of models, but the committed memory footprint keeps increasing.

I'm new to memory management so I'm sorry if what I said is wrong.


And I noticed that if I don't use the refiner and just use the normal model, the committed memory footprint is always maintained at a low level. I feel that there are indeed some problems here. Is it the mechanism of the refiner itself or the logic of the code? Is it inevitable or unavoidable?

erew123 commented 1 week ago

I wonder if its Gradio caching images into GPU VRAM that might what looks like a memory leak

DenOfEquity commented 1 week ago

The refiner loads the second pass model by calling the same function that is used when a model is normally loaded. The process is the same. The difference is the starting conditions - a different model already being used, inference in progress. In your case, you have enough VRAM that the problem doesn't get triggered with normal generation. But the extra demands of the refiner do expose it. Users with more VRAM might never see the issue; users with less could run into it every time they change model.

lvyonghuan commented 1 week ago

The refiner loads the second pass model by calling the same function that is used when a model is normally loaded. The process is the same. The difference is the starting conditions - a different model already being used, inference in progress. In your case, you have enough VRAM that the problem doesn't get triggered with normal generation. But the extra demands of the refiner do expose it. Users with more VRAM might never see the issue; users with less could run into it every time they change model.

So to solve this problem, do we need to set up a loading function specifically for the refinee model? I feel that refiner is quite commonly used.

lvyonghuan commented 1 week ago

Problem solved by https://github.com/lllyasviel/stable-diffusion-webui-forge/commit/19a9a78c9bb54d13a8f291f43939331af3242e8a. Thanks!