[bug]: ROCm Out of Memory Errors - Excessive VRAM Allocation

ThisNekoGuy commented 6 months ago

Is there an existing issue for this problem?

[X] I have searched the existing issues

Operating system

Linux

GPU vendor

AMD (ROCm)

GPU model

RX 7800 XT

GPU VRAM

16GB

Version number

4.2.0a4

Browser

LibreWolf 125.0.2-1

Python dependencies

{
  "accelerate": "0.29.2",
  "compel": "2.0.2",
  "cuda": null,
  "diffusers": "0.27.2",
  "numpy": "1.26.4",
  "opencv": "4.9.0.80",
  "onnx": "1.15.0",
  "pillow": "10.3.0",
  "python": "3.11.9",
  "torch": "2.2.2+rocm5.6",
  "torchvision": "0.17.2+rocm5.6",
  "transformers": "4.39.3",
  "xformers": null
}

What happened

When trying to generate an image, a huge amount of VRAM was allocated and prevented the image generation from being able to request any more to actually... generate the image.

Coming from Nvidia to AMD recently, of which my Nvidia card had only 11GBs, I find this unusual because I only tried making a 832x1480 image and that's not particularly large (at least, not large enough to trigger OOM on my Nvidia card when I used other Stable Diffusion front-ends before I came to InvokeAI today).

What you expected to happen

I expected the image to be able to generate without issue; probably even with VRAM to spare.

How to reproduce the problem

Edit the invoke.sh script to modify line 41 to HSA_OVERRIDE_GFX_VERSION=11.0.0 invokeai-web $PARAMS to get past an initial segfault bug when attempting generations
Use this model (~2 GBs)
Use this VAE (~385.9 MBs)
Set aspect ratio to 9:16
Set width to 832
Set height to 1480
Enable High Resolution Fix (or don't, it doesn't matter, result is the same)
If you enabled the High Resolution Fix, set denoise to 0.55 and set the upscaler to ESRGAN
Set the scheduler to DPM ++2M Karras
Set steps to 25
Set CFG Scale to 7.5
Leave the VAE at the default precision (FP32)
Set Click Skip to 2
Leave CFG Rescale Multiplier at 0 or follow the tooltip and set it to 0.7 (Result is the same, regardless)

Additional context

Specific Linux Distro: Gentoo (LLVM17 Built) Kernel: 6.9.0-rc6-tkg-eevdf-gentoo-llvm-zen2 Blas Implementation: openblas

Terminal Output:

``` Generate images with a browser-based interface >> patchmatch.patch_match: ERROR - patchmatch failed to load or compile (/usr/lib64/libtiff.so.6: undefined symbol: jpeg12_read_raw_data, version LIBJPEG_6.2). >> patchmatch.patch_match: INFO - Refer to https://invoke-ai.github.io/InvokeAI/installation/060_INSTALL_PATCHMATCH/ for installation instructions. [2024-05-02 07:37:46,993]::[InvokeAI]::INFO --> Patchmatch not loaded (nonfatal) [2024-05-02 07:38:06,846]::[InvokeAI]::INFO --> Using torch device: AMD Radeon RX 7800 XT [2024-05-02 07:38:07,024]::[InvokeAI]::INFO --> cuDNN version: 2020000 [2024-05-02 07:38:07,038]::[uvicorn.error]::INFO --> Started server process [19373] [2024-05-02 07:38:07,038]::[uvicorn.error]::INFO --> Waiting for application startup. [2024-05-02 07:38:07,038]::[InvokeAI]::INFO --> InvokeAI version 4.2.0a4 [2024-05-02 07:38:07,039]::[InvokeAI]::INFO --> Root directory = /mnt/chonker/InvokeAI/InstallDir [2024-05-02 07:38:07,039]::[InvokeAI]::INFO --> Initializing database at /mnt/chonker/InvokeAI/InstallDir/databases/invokeai.db [2024-05-02 07:38:07,277]::[InvokeAI]::INFO --> Pruned 1 finished queue items [2024-05-02 07:38:07,752]::[InvokeAI]::INFO --> Cleaned database (freed 0.02MB) [2024-05-02 07:38:07,752]::[uvicorn.error]::INFO --> Application startup complete. [2024-05-02 07:38:07,752]::[uvicorn.error]::INFO --> Uvicorn running on http://127.0.0.1:9090 (Press CTRL+C to quit) [2024-05-02 07:38:09,825]::[uvicorn.access]::INFO --> 127.0.0.1:58476 - "GET /ws/socket.io/?EIO=4&transport=polling&t=OyvJ-gP HTTP/1.1" 200 [2024-05-02 07:38:09,830]::[uvicorn.access]::INFO --> 127.0.0.1:58476 - "POST /ws/socket.io/?EIO=4&transport=polling&t=OyvJ-gf&sid=REhucWSM_N8uuUmfAAAA HTTP/1.1" 200 [2024-05-02 07:38:09,831]::[uvicorn.error]::INFO --> ('127.0.0.1', 58488) - "WebSocket /ws/socket.io/?EIO=4&transport=websocket&sid=REhucWSM_N8uuUmfAAAA" [accepted] [2024-05-02 07:38:09,832]::[uvicorn.error]::INFO --> connection open [2024-05-02 07:38:09,832]::[uvicorn.access]::INFO --> 127.0.0.1:58494 - "GET /ws/socket.io/?EIO=4&transport=polling&t=OyvJ-gf.0&sid=REhucWSM_N8uuUmfAAAA HTTP/1.1" 200 [2024-05-02 07:38:09,836]::[uvicorn.access]::INFO --> 127.0.0.1:58476 - "GET /ws/socket.io/?EIO=4&transport=polling&t=OyvJ-gf.1&sid=REhucWSM_N8uuUmfAAAA HTTP/1.1" 200 [2024-05-02 07:38:09,864]::[uvicorn.access]::INFO --> 127.0.0.1:58476 - "GET /api/v1/queue/default/status HTTP/1.1" 200 [2024-05-02 07:38:10,080]::[uvicorn.access]::INFO --> 127.0.0.1:58476 - "GET /api/v1/images/?board_id=none&categories=control&categories=mask&categories=user&categories=other&is_intermediate=false&limit=0&offset=0 HTTP/1.1" 200 [2024-05-02 07:38:10,081]::[uvicorn.access]::INFO --> 127.0.0.1:58494 - "GET /api/v1/app/config HTTP/1.1" 200 [2024-05-02 07:38:10,082]::[uvicorn.access]::INFO --> 127.0.0.1:58476 - "GET /api/v1/images/?board_id=none&categories=general&is_intermediate=false&limit=0&offset=0 HTTP/1.1" 200 [2024-05-02 07:38:10,082]::[uvicorn.access]::INFO --> 127.0.0.1:58494 - "GET /api/v1/images/intermediates HTTP/1.1" 200 [2024-05-02 07:38:10,083]::[uvicorn.access]::INFO --> 127.0.0.1:58506 - "GET /api/v1/app/version HTTP/1.1" 200 [2024-05-02 07:38:10,083]::[uvicorn.access]::INFO --> 127.0.0.1:58510 - "GET /api/v1/boards/?all=true HTTP/1.1" 200 [2024-05-02 07:38:10,084]::[uvicorn.access]::INFO --> 127.0.0.1:58524 - "GET /api/v1/images/?board_id=none&categories=general&is_intermediate=false&limit=100&offset=0 HTTP/1.1" 200 [2024-05-02 07:38:10,093]::[uvicorn.access]::INFO --> 127.0.0.1:58536 - "GET /api/v1/app/app_deps HTTP/1.1" 200 [2024-05-02 07:38:10,094]::[uvicorn.access]::INFO --> 127.0.0.1:58476 - "GET /api/v1/queue/default/list HTTP/1.1" 200 [2024-05-02 07:38:10,095]::[uvicorn.access]::INFO --> 127.0.0.1:58494 - "GET /api/v1/queue/default/status HTTP/1.1" 200 [2024-05-02 07:38:19,190]::[uvicorn.access]::INFO --> 127.0.0.1:40554 - "POST /api/v1/queue/default/enqueue_batch HTTP/1.1" 200 [2024-05-02 07:38:19,410]::[uvicorn.access]::INFO --> 127.0.0.1:40554 - "GET /api/v1/queue/default/status HTTP/1.1" 200 [2024-05-02 07:38:19,444]::[uvicorn.access]::INFO --> 127.0.0.1:40568 - "GET /api/v1/queue/default/list HTTP/1.1" 200 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:09<00:00, 2.57it/s] /mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/lightning_utilities/core/imports.py:14: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources /mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg) /mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/lightning_fabric/__init__.py:40: Deprecated call to `pkg_resources.declare_namespace('lightning_fabric')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages /mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/pytorch_lightning/__init__.py:37: Deprecated call to `pkg_resources.declare_namespace('pytorch_lightning')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages [2024-05-02 07:38:52,148]::[ModelLoadService]::INFO --> Converting /mnt/chonker/InvokeAI/InstallDir/models/sd-1/vae/kl-f8-anime2.ckpt to diffusers format [2024-05-02 07:39:05,988]::[uvicorn.access]::INFO --> 127.0.0.1:33102 - "GET /api/v1/images/i/b38fe7ca-e4a0-404c-ba73-6a5f59acd186.png HTTP/1.1" 200 [2024-05-02 07:39:06,196]::[InvokeAI]::INFO --> Downloading RealESRGAN_x4plus.pth... RealESRGAN_x4plus.pth: 67.1MiB [01:12, 929kiB/s] Upscaling: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.08s/it] [2024-05-02 07:40:32,518]::[uvicorn.access]::INFO --> 127.0.0.1:37426 - "GET /api/v1/images/i/f9bc272e-5d2e-4db7-91b6-618951310362.png HTTP/1.1" 200 [2024-05-02 07:40:32,728]::[uvicorn.access]::INFO --> 127.0.0.1:37426 - "GET /api/v1/images/i/08b0fd84-8637-4d7c-981b-db15358bc173.png HTTP/1.1" 200 0%| | 0/14 [00:03 Error while invoking session f0d54825-89e2-4f9f-8acb-4e24b2f43737, invocation 18dd1847-bf84-4d6b-9269-f76177444e74 (denoise_latents): HIP out of memory. Tried to allocate 11.03 GiB. GPU 0 has a total capacity of 15.98 GiB of which 2.48 GiB is free. Of the allocated memory 12.84 GiB is allocated by PyTorch, and 48.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-05-02 07:40:52,353]::[InvokeAI]::ERROR --> Traceback (most recent call last): File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/app/services/session_processor/session_processor_default.py", line 185, in _process outputs = self._invocation.invoke_internal( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/app/invocations/baseinvocation.py", line 281, in invoke_internal output = self.invoke(context) ^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/app/invocations/latent.py", line 991, in invoke result_latents = pipeline.latents_from_embeddings( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/backend/stable_diffusion/diffusers_pipeline.py", line 339, in latents_from_embeddings latents = self.generate_latents_from_embeddings( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/backend/stable_diffusion/diffusers_pipeline.py", line 419, in generate_latents_from_embeddings step_output = self.step( ^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/backend/stable_diffusion/diffusers_pipeline.py", line 517, in step uc_noise_pred, c_noise_pred = self.invokeai_diffuser.do_unet_step( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/backend/stable_diffusion/diffusion/shared_invokeai_diffusion.py", line 199, in do_unet_step ) = self._apply_standard_conditioning( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/backend/stable_diffusion/diffusion/shared_invokeai_diffusion.py", line 343, in _apply_standard_conditioning both_results = self.model_forward_callback( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/invokeai/backend/stable_diffusion/diffusers_pipeline.py", line 590, in _unet_forward return self.unet( ^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1216, in forward sample, res_samples = downsample_block( ^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 1279, in forward hidden_states = attn( ^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/diffusers/models/transformers/transformer_2d.py", line 397, in forward hidden_states = block( ^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/diffusers/models/attention.py", line 329, in forward attn_output = self.attn1( ^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 522, in forward return self.processor( ^^^^^^^^^^^^^^^ File "/mnt/chonker/InvokeAI/InstallDir/.venv/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 1279, in __call__ hidden_states = F.scaled_dot_product_attention( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 11.03 GiB. GPU 0 has a total capacity of 15.98 GiB of which 2.48 GiB is free. Of the allocated memory 12.84 GiB is allocated by PyTorch, and 48.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-05-02 07:40:52,354]::[InvokeAI]::INFO --> Graph stats: f0d54825-89e2-4f9f-8acb-4e24b2f43737 Node Calls Seconds VRAM Used main_model_loader 1 0.001s 0.000G clip_skip 1 0.000s 0.000G compel 2 5.175s 0.343G collect 2 0.000s 0.236G noise 2 0.004s 0.294G denoise_latents 2 27.174s 12.907G core_metadata 1 0.000s 1.615G vae_loader 1 0.000s 1.615G l2i 1 18.170s 3.224G esrgan 1 86.134s 7.383G img_resize 1 0.535s 0.294G i2l 1 15.543s 3.693G TOTAL GRAPH EXECUTION TIME: 152.736s TOTAL GRAPH WALL TIME: 152.744s RAM used by InvokeAI process: 3.69G (+2.874G) RAM used to load models: 3.97G VRAM in use: 1.615G RAM cache statistics: Model cache hits: 10 Model cache misses: 5 Models cached: 5 Models cleared from cache: 0 Cache high water mark: 1.99/7.50G ```

Discord username

No response

psychedelicious commented 6 months ago

That is super excessive VRAM usage. Looks like it is running really, really slowly too. The ESRGAN upscale is ridiculously slow.

Unfortunately, I don't have an AMD GPU to test on. Not sure if any regular contributors do. May need community help to figure this out...

From a brief search, it looks like the env var PYTORCH_HIP_ALLOC_CONF may be useful. This old post for A1111 has a value to try: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6460#issuecomment-1382623085

If fp32 is being used instead of fp16, you'll use ~2x VRAM - try setting precision: float16 in invokeai.yaml.

ThisNekoGuy commented 6 months ago

Setting both PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 and precision: float16 doesn't seem to fix it. :v (But then again, I noticed that adding log_format: color didn't actually add colored output either, so maybe the yaml just isn't being read? I'm loading the environment variables directly via the script though, so I know that's at least being read) This is weird.

psychedelicious commented 6 months ago

If you suspect the YAML file isn't being read, an easy way to test is add use_memory_db: true. You should see a line like this on startup:

[2024-05-03 13:30:13,686]::[InvokeAI]::INFO --> Initializing in-memory database

Sorry, I'm not sure where to go from here with the performance issue :/

ThisNekoGuy commented 6 months ago

Ah, that does appear in the terminal output; so I guess it is reading the file but just... not outputting color for some reason? I guess that would otherwise be somewhat reasonably safe to assume that precision: float16 is being passed but whatever the root problem is simply doesn't care.

psychedelicious commented 6 months ago

We expect colored logs for warnings and errors only (and debug, if you set the log level accordingly). This works for me.

I'm confident the precision is set correctly in normal circumstances, but who knows with the issues you are facing.

KeithTheKeiser commented 6 months ago

Having A similar issue, Though my system is not allocating massive amounts of VRAM for txt2img, It has a staggeringly low cap. So I can generate images fine but the moment I move to inpainting I cant do anything, and get this error:

OutOfMemoryError: HIP out of memory. Tried to allocate 13.91 GiB. GPU 0 has a total capacity of 19.98 GiB of which 3.43 GiB is free. Of the allocated memory 16.06 GiB is allocated by PyTorch, and 41.54 MiB is reserved by PyTorch but unallocated.

So, for some reason, on my RX 7900 XT with 20GB VRAM, only a measly 3.43GB are available.. It doesn't make sense

Edit: I am on the same system, Ubuntu Linux using AMD ROCm for processing

mcgravier commented 6 months ago

So there are few things I observed with this

Tests are done with 1024x768 image generation with SD1.5 models. SDXL doesn't seem to suffer from the issue.

All vram usage values were noted with Corectrl software - so it's total system Vram usage, not just what Invoke reports. I use 7900XTX 24GB

Set vram: 0 in the yaml config file. Setting it even to a small value like 0.5 causes multiple gigs of vram to be wasted
Set attention_type: sliced and attention_slice_size: max - for some reason this reduces vram usage by a LOT compared to other options. You can squeeze 1472x1472 image into 12GB with this. BUT it doesn't help at all when IP-Adapters and Controlnets are involved
Set sequential_guidance: true - this shaves off additional few gigs of vram in all cases

The worst case scenario in my case is when I use upscaling node on 1024x768 image (scale factor 2). Without aforementioned settings it sometimes fails to process on 24GB GPU! With theses settings it completes with ~15GB of peak Vram usage

invoke-ai / InvokeAI