lshqqytiger / stable-diffusion-webui-amdgpu

Stable Diffusion web UI
GNU Affero General Public License v3.0
1.88k stars 192 forks source link

[Bug]: VRAM memory leak #267

Closed cactus24556 closed 9 months ago

cactus24556 commented 1 year ago

Is there an existing issue for this?

What happened?

I launch the webui from the webui-user.bat and it starts up normally except I notice once the webui is open in my browser my VRAM is filled about to 5GB out of my 8GB. It seems to be happening when it is trying to load the model. So when i try to generate a image it runs out of VRAM in the middle and the generation is stopped. This even happens on a fresh install of the webui where it uses the default model.

Also after trying to generate the image unsuccessfully the VRAM is not released at all. My VRAM is completely full until I shutdown the program.

Another issue is that in the webui I am unable to change which model the generator uses. I pick another model and it shows it processing it but it fails to change and also it takes up my VRAM and does not release any VRAM when it finishes.

Steps to reproduce the problem

  1. Start webui with webui-user.bat
  2. Look at task manager and see VRAM is already half used up.

What should have happened?

I should have enough VRAM to generate a 512x512 image without running out of VRAM. My VRAM should be clear if I'm not generating anything.

Sysinfo

sysinfo-2023-09-08-20-11.txt

What browsers do you use to access the UI ?

Mozilla Firefox

Console logs

venv "C:\stable-diffusion-webui-directml\venv\Scripts\Python.exe"
fatal: No names found, cannot describe anything.
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: 1.6.0
Commit hash: 92849df26f73b416d396b95c3fb8c64070fe3ad8
Launching Web UI with arguments:
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Loading weights [b4c1d10a2d] from C:\stable-diffusion-webui-directml\models\Stable-diffusion\perfectWorld_v5Baked.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Creating model from config: C:\stable-diffusion-webui-directml\configs\v1-inference.yaml
Startup time: 7.5s (prepare environment: 0.2s, import torch: 2.7s, import gradio: 0.7s, setup paths: 0.7s, initialize shared: 1.0s, other imports: 0.3s, load scripts: 0.9s, create ui: 0.3s, gradio launch: 0.6s).
Applying attention optimization: InvokeAI... done.
Model loaded in 6.7s (load weights from disk: 0.5s, create model: 0.5s, apply weights to model: 4.9s, apply half(): 0.5s, calculate empty prompt: 0.2s).

Additional information

No response

lshqqytiger commented 1 year ago

torch-directml can't release/collect/empty memory because its tensor implementation inherits OpaqueTensorImpl which can't have storage.

import torch
import torch_directml

device = torch_directml.device()

torch.tensor(1.0).storage() # [torch.storage.TypedStorage(dtype=torch.float32, device=cpu) of size 1]
torch.tensor(1.0).to(device).storage() # NotImplementedError: Cannot access storage of OpaqueTensorImpl

But onnxruntime-directml partially does release memory. If you want something like that, you can use Olive+ONNX implementation instead.

cactus24556 commented 1 year ago

Thanks for the response. I'm not sure I want to commit with the olive+onnx. As I would have to optimize all the models I use when some commits ago I could generate images just fine without any modifications.

cactus24556 commented 1 year ago

managed to generate images by using --medvram --medvram-sdxl --opt-split-attention --no-half-vae --disable-nan-check. i guess that's the trick. Even when all the VRAM is full I can generate an image right after I made another image.

Lamer217 commented 1 year ago

I confirm that I encounter the exact same issue, though using the provided flags doesn't help in my case. My setup is:

The project manages to generate the first picture for me too. During that process the loaded VRAM is gradually filled to 97% in about 15 seconds. It never gets decreased after that. There is always a slight stall in the generation process, of about 10 seconds at exactly 50%, probably due to the upscaler. After the first picture is generated, the next generation unconditionally fails at a random part of the process, but always the first half. So essentially it's a single use tool, at this point. After the first generation it has to be reloaded in a most literal way: stopping and relaunching the script.

Here's the terminal output: ``` venv "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\Scripts\Python.exe" fatal: No names found, cannot describe anything. Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] Version: 1.6.0 Commit hash: e9afd9aed55da48dfc917753e2daa114a515a85b Launching Web UI with arguments: --medvram --medvram-sdxl --opt-split-attention --no-half-vae --disable-nan-check no module 'xformers'. Processing without... no module 'xformers'. Processing without... No module 'xformers'. Proceeding without it. 2023-09-13 20:10:24,201 - ControlNet - INFO - ControlNet v1.1.410 ControlNet preprocessor location: C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\annotator\downloads 2023-09-13 20:10:24,279 - ControlNet - INFO - ControlNet v1.1.410 Loading weights [6ce0161689] from C:\Users\aport\Codebase\stable-diffusion-webui-directml\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Creating model from config: C:\Users\aport\Codebase\stable-diffusion-webui-directml\configs\v1-inference.yaml Startup time: 7.9s (prepare environment: 0.5s, import torch: 2.7s, import gradio: 0.7s, setup paths: 0.7s, initialize shared: 1.0s, other imports: 0.3s, load scripts: 1.0s, create ui: 0.4s, gradio launch: 0.5s). Applying attention optimization: Doggettx... done. Model loaded in 5.3s (load weights from disk: 0.7s, create model: 0.3s, apply weights to model: 3.5s, apply half(): 0.4s, calculate empty prompt: 0.3s). 2023-09-13 20:10:40,732 - ControlNet - INFO - Loading model: control_v1p_sd15_qrcode_monster [a6e58995] 2023-09-13 20:10:40,771 - ControlNet - INFO - Loaded state_dict from [C:\Users\aport\Codebase\stable-diffusion-webui-directml\models\ControlNet\control_v1p_sd15_qrcode_monster.safetensors] 2023-09-13 20:10:40,772 - ControlNet - INFO - controlnet_default_config 2023-09-13 20:10:42,777 - ControlNet - INFO - ControlNet model control_v1p_sd15_qrcode_monster [a6e58995] loaded. 2023-09-13 20:10:42,837 - ControlNet - INFO - Loading preprocessor: invert 2023-09-13 20:10:42,837 - ControlNet - INFO - preprocessor resolution = 512 2023-09-13 20:10:43,058 - ControlNet - INFO - ControlNet Hooked - Time = 2.477725028991699 100%|█████████████████████████████████████████████████████████████████████████████████| 25/25 [00:22<00:00, 1.09it/s] 100%|█████████████████████████████████████████████████████████████████████████████████| 25/25 [01:56<00:00, 4.66s/it] Total progress: 100%|█████████████████████████████████████████████████████████████████| 50/50 [02:22<00:00, 2.85s/it] 2023-09-13 20:17:01,318 - ControlNet - INFO - Loading model from cache: control_v1p_sd15_qrcode_monster [a6e58995]/it] 2023-09-13 20:17:01,319 - ControlNet - INFO - Loading preprocessor: invert 2023-09-13 20:17:01,319 - ControlNet - INFO - preprocessor resolution = 512 2023-09-13 20:17:01,383 - ControlNet - INFO - ControlNet Hooked - Time = 0.06499814987182617 100%|█████████████████████████████████████████████████████████████████████████████████| 25/25 [00:12<00:00, 1.99it/s] Total prTile 1/9 50%|████████████████████████████████▌ | 25/50 [00:12<00:09, 2.55it/s] Tile 2/9 Tile 3/9 Tile 4/9 Tile 5/9 Tile 6/9 Tile 7/9 Tile 8/9 Tile 9/9 0%| | 0/25 [00:00, 0, False, '', 0.8, 3002328134, False, -1, 0, 0, 0, , , , False, False, 'positive', 'comma', 0, False, False, '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0, False, None, None, False, None, None, False, None, None, False, 50) {} Traceback (most recent call last): File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\call_queue.py", line 57, in f res = list(func(*args, **kwargs)) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\call_queue.py", line 36, in f res = func(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\txt2img.py", line 64, in txt2img processed = processing.process_images(p) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\processing.py", line 733, in process_images res = process_images_inner(p) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 42, in processing_process_images_hijack return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\processing.py", line 871, in process_images_inner samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\scripts\hook.py", line 451, in process_sample return process.sample_before_CN_hack(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\processing.py", line 1160, in sample return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subseed_strength, prompts) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\processing.py", line 1246, in sample_hr_pass samples = self.sampler.sample_img2img(self, samples, noise, self.hr_c, self.hr_uc, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\sd_samplers_kdiffusion.py", line 191, in sample_img2img samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs)) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\sd_samplers_common.py", line 261, in launch_sampling return func() File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\sd_samplers_kdiffusion.py", line 191, in samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs)) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\k-diffusion\k_diffusion\sampling.py", line 594, in sample_dpmpp_2m denoised = model(x, sigmas[i] * s_in, **extra_args) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\sd_samplers_cfg_denoiser.py", line 169, in forward x_out = self.inner_model(x_in, sigma_in, cond=make_condition_dict(cond_in, image_cond_in)) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps return self.inner_model.apply_model(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\sd_hijack_utils.py", line 17, in setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs)) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\sd_hijack_utils.py", line 28, in __call__ return self.__orig_func(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model x_recon = self.model(x_noisy, t, **cond) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1335, in forward out = self.diffusion_model(x, t, context=cc) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\scripts\hook.py", line 858, in forward_webui raise e File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\scripts\hook.py", line 855, in forward_webui return forward(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\scripts\hook.py", line 592, in forward control = param.control_model(x=x_in, hint=hint, timesteps=timesteps, context=context, y=y) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\scripts\cldm.py", line 31, in forward return self.control_model(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\extensions\sd-webui-controlnet\scripts\cldm.py", line 314, in forward h = module(h, emb, context) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\generative-models\sgm\modules\diffusionmodules\openaimodel.py", line 100, in forward x = layer(x, context) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\generative-models\sgm\modules\attention.py", line 627, in forward x = block(x, context=context[i]) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\generative-models\sgm\modules\attention.py", line 459, in forward return checkpoint( File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\generative-models\sgm\modules\diffusionmodules\util.py", line 167, in checkpoint return func(*inputs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\repositories\generative-models\sgm\modules\attention.py", line 467, in _forward self.attn1( File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\aport\Codebase\stable-diffusion-webui-directml\modules\sd_hijack_optimizations.py", line 262, in split_cross_attention_forward raise RuntimeError(f'Not enough memory, use lower resolution (max approx. {max_res}x{max_res}). ' RuntimeError: Not enough memory, use lower resolution (max approx. 896x896). Need: 0.9GB free, Have:0.5GB free --- ```

Does such problem happen only for AMDs? Is there any prediction for a fix? (I understand that it's related to a different project)

GeneralVincent1 commented 1 year ago

Yes, this is something that AMD users have to find workarounds for while Nvidia users generally have an easier time from what I understand.

Here is a great discussion about ways to avoid running out of Vram and having it crash after running it once

lshqqytiger commented 1 year ago

You should know: this does not happen only for AMD. This will happen for ANY GPUs with torch-directml. Originally, we should use ROCm, a low-level toolkit like CUDA, but we use DirectML because it didn't support Windows (now partially does). This alone, there should not be a memory issue, but there was an issue where torch-directml, a plugin library for PyTorch, inherited OpaqueTensorImpl, making it impossible to track the memory of the Tensor. I don't know if its developers are interested in fixing such issue with torch-directml, but if you want to avoid memory issue at this point, you should use onnxruntime-directml (implemented with --onnx in this repository, but it will be slow if you don't optimize models with Olive), or wait until PyTorch has full support on ROCm.

Lamer217 commented 1 year ago

Thanks for such a detailed reply, @lshqqytiger. From this I see my options are following:

I'll try the Linux and ROCm option first, as it seems like the onnxruntime-directml solution is not that well established yet.

lshqqytiger commented 1 year ago

The Linux + ROCm option will be the best option if you are familiar with the Linux environment. DirectML is the secondary for who is not.

johiny commented 1 year ago

The Linux + ROCm option will be the best option if you are familiar with the Linux environment. DirectML is the secondary for who is not.

I have a doubt the ROCm setup could be possible with WSL? I have been reading that it can access GPU but I don't know if it is compatible with ROCm or someone is working on that https://learn.microsoft.com/en-us/windows/wsl/tutorials/gpu-compute

lshqqytiger commented 1 year ago

ROCm on WSL can not find GPU. I think it accesses GPU as a virtual device? (like VM)