lllyasviel / stable-diffusion-webui-forge

GNU Affero General Public License v3.0
7.33k stars 709 forks source link

FLUX 1d NF, torch.cuda.OutOfMemoryError: CUDA out of memory with LORA #1202

Open nailz420 opened 3 weeks ago

nailz420 commented 3 weeks ago

OOM (16gb VRAM)

version: f2.0.1v1.10.1-previous-304-g394da019  •  python: 3.10.6  •  torch: 2.3.1+cu121  •  xformers: N/A  •  gradio: 4.40.0  •  checkpoint: a8038adff1

Begin to load 1 model [Unload] Trying to free 9411.13 MB for cuda:0 with 0 models keep loaded ... [Unload] Current free memory is 9773.37 MB ... [Memory Management] Current Free GPU Memory: 9773.37 MB [Memory Management] Required Model Memory: 6246.84 MB [Memory Management] Required Inference Memory: 1024.00 MB [Memory Management] Estimated Remaining GPU Memory: 2502.53 MB Patching LoRAs: 43%|███████████████████████████▎ | 130/304 [00:06<00:08, 19.72it/s]ERROR lora diffusion_model.double_blocks.13.img_mod.lin.weight CUDA out of memory. Tried to allocate 216.00 MiB. GPU Patching LoRAs: 44%|████████████████████████████▍ | 135/304 [00:06<00:12, 13.30it/s]ERROR lora diffusion_model.double_blocks.13.txt_mod.lin.weight CUDA out of memory. Tried to allocate 216.00 MiB. GPU ERROR lora diffusion_model.double_blocks.13.txt_attn.qkv.weight CUDA out of memory. Tried to allocate 108.00 MiB. GPU Patching LoRAs: 45%|████████████████████████████▊ | 137/304 [00:06<00:11, 14.01it/s]ERROR lora diffusion_model.double_blocks.13.txt_mlp.0.weight CUDA out of memory. Tried to allocate 144.00 MiB. GPU Patching LoRAs: 46%|█████████████████████████████▎ | 139/304 [00:06<00:08, 19.95it/s] Traceback (most recent call last): File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules_forge\main_thread.py", line 30, in work self.result = self.func(*self.args, **self.kwargs) File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\txt2img.py", line 110, in txt2img_function processed = processing.process_images(p) File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\processing.py", line 809, in process_images res = process_images_inner(p) File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\processing.py", line 952, in process_images_inner samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts) File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\processing.py", line 1323, in sample samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x)) File "F:\projects\AI\webui_forge_cu121_torch231\webui\modules\sd_samplers_kdiffusion.py", line 194, in sample sampling_prepare(self.model_wrap.inner_model.forge_objects.unet, x=x) File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\sampling\sampling_function.py", line 356, in sampling_prepare memory_management.load_models_gpu( File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 575, in load_models_gpu loaded_model.model_load(model_gpu_memory_when_using_cpu_swap) File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 384, in model_load raise e File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 380, in model_load self.real_model = self.model.forge_patch_model(patch_model_to) File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\patcher\base.py", line 228, in forge_patch_model self.lora_loader.refresh(target_device=target_device, offload_device=self.offload_device) File "F:\projects\AI\webui_forge_cu121_torch231\webui\backend\patcher\lora.py", line 352, in refresh weight = weight.to(dtype=torch.float32) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB. GPU CUDA out of memory. Tried to allocate 144.00 MiB. GPU

Tom-Neverwinter commented 3 weeks ago

need additional details.

which model. what os etc.

if its not the gguf or nf4 its not going to fit on your hardware

elen07zz commented 3 weeks ago

need additional details.

which model. what os etc.

if its not the gguf or nf4 its not going to fit on your hardware

For some reason dev fp8 works better than nf4 when I use lora, Nf4 just use a lot of vram and ram making the generation speed absurdly slow.

nailz420 commented 3 weeks ago

Windows 11 64bit pro, 32gb RAM, 4070 TI SUPER 16GB VRAM, lllyasvielFlux1DevBnb_flux1DevBnbNf4V2.safetensors, lora:boreal-flux-lora-v0.4

version: f2.0.1v1.10.1-previous-317-g4bb56139  •  python: 3.10.6  •  torch: 2.3.1+cu121  •  xformers: N/A  •  gradio: 4.40.0  •  checkpoint: 5181ee364f