OutOfMemory with 12GB VRAM

cdb-boop commented 9 months ago

It was said in the original repo, and you also thought it was the case, that it is possible to get running within 12GB VRAM, but I just can't get it to work with this wrapper.

I downloaded all the models, resolved issues with xformers and pytorch+cuda, used an integrated GPU for my display, fiddled with the settings (use_tiled_vae, diffusion_dtype and encoder_dtype) and input a test 512x512 image. Am I missing something?

** Platform: Windows
** Python version: 3.10.13

Total VRAM 12288 MB, total RAM 65309 MB
xformers version: 0.0.23.post1
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3060 : cudaMallocAsync
VAE dtype: torch.bfloat16
Using xformers cross attention

...
...
...

Diffusion using using bf16
Encoder using using bf16
Building a Downsample layer with 2 dims.
  --> settings are:
 in-chn: 320, out-chn: 320, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Building a Downsample layer with 2 dims.
  --> settings are:
 in-chn: 640, out-chn: 640, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Initialized embedder #0: FrozenCLIPEmbedder with 123060480 params. Trainable: False
Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694659841 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #3: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #4: ConcatTimestepEmbedderND with 0 params. Trainable: False
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Building a Downsample layer with 2 dims.
  --> settings are:
 in-chn: 320, out-chn: 320, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Building a Downsample layer with 2 dims.
  --> settings are:
 in-chn: 640, out-chn: 640, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
Loaded state_dict from [C:\Users\User\ComfyUI\models\checkpoints\SUPIR-v0Q-001.ckpt]
Loaded state_dict from [C:\Users\User\stable-diffusion-webui\models\Stable-diffusion\xl\realvisXLv3.0_v30Bakedvae.safetensors]
ERROR:root:!!! Exception during processing !!!
ERROR:root:Traceback (most recent call last):
  File "C:\Users\User\ComfyUI\execution.py", line 152, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
  File "C:\Users\User\ComfyUI\execution.py", line 82, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
  File "C:\Users\User\ComfyUI\execution.py", line 75, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
  File "C:\Users\User\ComfyUI\custom_nodes\ComfyUI-SUPIR\nodes.py", line 142, in process
    self.model.to(device).to(dtype)
  File "C:\Users\User\anaconda3\envs\comfyui\lib\site-packages\lightning_fabric\utilities\device_dtype_mixin.py", line 54, in to
    return super().to(*args, **kwargs)
  File "C:\Users\User\anaconda3\envs\comfyui\lib\site-packages\torch\nn\modules\module.py", line 1160, in to
    return self._apply(convert)
  File "C:\Users\User\anaconda3\envs\comfyui\lib\site-packages\torch\nn\modules\module.py", line 810, in _apply
    module._apply(fn)
  File "C:\Users\User\anaconda3\envs\comfyui\lib\site-packages\torch\nn\modules\module.py", line 810, in _apply
    module._apply(fn)
  File "C:\Users\User\anaconda3\envs\comfyui\lib\site-packages\torch\nn\modules\module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 6 more times]
  File "C:\Users\User\anaconda3\envs\comfyui\lib\site-packages\torch\nn\modules\module.py", line 833, in _apply
    param_applied = fn(param)
  File "C:\Users\User\anaconda3\envs\comfyui\lib\site-packages\torch\nn\modules\module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 11.31 GiB
Requested               : 6.25 MiB
Device limit            : 12.00 GiB
Free (according to CUDA): 0 bytes
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB

MoreColors123 commented 9 months ago

I'm not 100% sure but I think i read somewhere that it only works with proper SDXL checkpoints, no Turbo, no Lightning. Can someone confirm that? And is RealvisXL 3 of that kind?

kijai commented 9 months ago

Have you disabled system memory fallback from nvidia drivers? I'm able to do 512 -> 1024 with my 10GB 3080, it's slow (2 mins) and uses system memory when it peaks, but it works. You can also try with reducing the tiled vae size.

kijai commented 9 months ago

I'm not 100% sure but I think i read somewhere that it only works with proper SDXL checkpoints, no Turbo, no Lightning. Can someone confirm that? And is RealvisXL 3 of that kind?

Pretty sure the scheduler used only works with normal SDXL models.

cdb-boop commented 9 months ago

I'm not 100% sure but I think i read somewhere that it only works with proper SDXL checkpoints, no Turbo, no Lightning. Can someone confirm that? And is RealvisXL 3 of that kind?

I just tried the base model and the same problem.

Have you disabled system memory fallback from nvidia drivers? I'm able to do 512 -> 1024 with my 10GB 3080, it's slow (2 mins) and uses system memory when it peaks, but it works. You can also try with reducing the tiled vae size.

Yeah, I had disabled fallback. I'll need to investigate more.

zreren commented 9 months ago

I had the same problem in my 3060 with 12g ram 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory) Currently allocated : 11.31 GiB Requested : 6.25 MiB Device limit : 12.00 GiB Free (according to CUDA): 0 bytes PyTorch limit (set by user-supplied memory fraction) : 17179869184.00 GiB

zreren commented 9 months ago

I disabled system memory fallback and it works

kijai commented 9 months ago

I'm not 100% sure but I think i read somewhere that it only works with proper SDXL checkpoints, no Turbo, no Lightning. Can someone confirm that? And is RealvisXL 3 of that kind?

I just tried the base model and the same problem.

Have you disabled system memory fallback from nvidia drivers? I'm able to do 512 -> 1024 with my 10GB 3080, it's slow (2 mins) and uses system memory when it peaks, but it works. You can also try with reducing the tiled vae size.

Yeah, I had disabled fallback. I'll need to investigate more.

I found one bug that caused a memory spike after model load, could explain this too, it's fixed now. With system memory fallback it wasn't an issue so I didn't notice it at first.

cdb-boop commented 9 months ago

With commit c74b8248a73352dc5bdc99496006e96321738f38, it seems to be working now. I'm seeing a peak VRAM usage of 10427MiB and idle of 9339MiB with a 512x512 to 512x512 pass and the default settings.

Joly0 commented 8 months ago

Hey guys, somehow i am unable to get this working on my 4060ti with 16gb vram. I keep getting:

ERROR:root:!!! Exception during processing !!!
ERROR:root:Traceback (most recent call last):
  File "/config/05-comfy-ui/ComfyUI/execution.py", line 152, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
  File "/config/05-comfy-ui/ComfyUI/execution.py", line 82, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
  File "/config/05-comfy-ui/ComfyUI/execution.py", line 75, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
  File "/config/05-comfy-ui/ComfyUI/custom_nodes/ComfyUI-SUPIR/nodes.py", line 212, in process
    self.model.init_tile_vae(encoder_tile_size=encoder_tile_size_pixels, decoder_tile_size=decoder_tile_size_latent)
AttributeError: 'SUPIR_Upscale' object has no attribute 'model'

i thought i´d ask here first, before opening a new issue. But it seems like people got it to work with 12gb of vram, i have 16 and i tried even with super low resolutions like 265x265

cdb-boop commented 8 months ago

@Joly0 Always fine to ask before opening a new issue. :)

Anyways, I don't see a memory issue in the output you've showed. It looks like the model isn't getting initialized. I'd suggest double checking you downloaded and added the model weights correctly and afterwards opening a new issue including all relevant debug outputs.

kijai / ComfyUI-SUPIR

OutOfMemory with 12GB VRAM #12