Huge VAE memory leaks with --ZLUDA

TheFerumn commented 1 week ago

I recently upgraded my main PC so i swapped CPU and Motherboard in my older AMD setup with RX570. I decided i will test ZLUDA performance and i noticed huge memory leaks comparing to DirectML. I decided i will do some testing. Every generation was on the same settings with SDXL model with LORA and Never OOM activated. This are my memory notes from the tests:

IDLE Fresh Launch:
Dedicated   1.3/4.0 GB
Shared      1.3/7.9 GB

1st Generation 1024x1024 with LORA:     + 1.5 GB
Dedicated   2.8/4.0 GB
Shared      1.3/7.9 GB
1st VAE decoding:               + 1.4 GB
Dedicated   3.8/4.0 GB
Shared      1.7/7.9 GB

2nd Generation 1024x1024 with LORA:     + 0.3 GB
Dedicated   3.7/4.0 GB
Shared      2.1/7.9 GB
2nd VAE decoding:               + 1.3 GB
Dedicated   3.8/4.0 GB
Shared      3.3/7.9 GB

3rd Generation 1024x1024 with LORA:     - 0.4 GB
Dedicated   3.8/4.0 GB
Shared      2.9/7.9 GB
3rd VAE decoding:               + 1.5 GB    
Dedicated   3.8/4.0 GB
Shared      4.4/7.9 GB

4th Generation 1024x1024 with LORA:     - 0.4 GB
Dedicated   3.8/4.0 GB
Shared      4.0/7.9 GB
4th VAE decoding:               + 1.5 GB
Dedicated   3.8/4.0 GB
Shared      5.5/7.9 GB

5th Generation 1024x1024 with LORA:     - 0.4 GB
Dedicated   3.8/4.0 GB
Shared      5.1/7.9 GB
5th VAE decoding:               + 1.5 GB
Dedicated   3.8/4.0 GB
Shared      6.6/7.9 GB

6th Generation 1024x1024 with LORA:     - 0.4 GB
Dedicated   3.8/4.0 GB
Shared      6.2/7.9 GB
6th VAE decoding:               + 1.5 GB
Dedicated   3.8/4.0 GB
Shared      7.7/7.9 GB

7th Generation 1024x1024 with LORA:     - 0.3 GB
Dedicated   3.8/4.0 GB
Shared      7.4/7.9 GB
7th VAE decoding:
OOM

After that i did some tests with VAE cache enabled/disabled and with external VAE or built in checkpoint but results was almost the same. I decided i will change VAE method to TAESD and noticed around 0.7 GB lesser leaks per generation. Here are my notes for TAESD method:

IDLE Fresh Launch:
Dedicated   1.3/4.0 GB
Shared      1.3/7.9 GB

1st Generation 1024x1024 with LORA:     + 1.5 GB
Dedicated   2.8/4.0 GB
Shared      1.3/7.9 GB

2nd Generation 1024x1024 with LORA:     + 0.8 GB
Dedicated   3.6/4.0 GB
Shared      1.3/7.9 GB

3rd Generation 1024x1024 with LORA:     + 0.9 GB
Dedicated   3.7/4.0 GB
Shared      2.1/7.9 GB

4th Generation 1024x1024 with LORA:     + 0.1 GB
Dedicated   3.7/4.0 GB
Shared      2.2/7.9 GB

5th Generation 1024x1024 with LORA:     + 0.4 GB
Dedicated   3.7/4.0 GB
Shared      2.6/7.9 GB

6th Generation 1024x1024 with LORA:     + 0.4 GB
Dedicated   3.7/4.0 GB
Shared      3.0/7.9 GB
+ 0.4 GB for the next generation etc.

As you can see at 6th Generation there is 4.6 GB less memory used comparing to Full VAE method. Further i tried to eliminate rest of the memory loss and i noted around 100 MB VRAM loss per generation on the LORA model. I used 324 MB LORA for testing. Going into Settings/Actions and using Unload all models doesn't do anything. I haven't went deep enough to understand difference between VAE methods so my question is this normal behaviour and the difference can really be so massive ?

TheFerumn commented 6 days ago

Can anyone confirm if its happening to everybody or is it maybe poor ZLUDA compatibility with such old GPU ? I am doing a test right now with DirectML and there is 0 memory loss no matter which VAE method i choose it will just take 4.4 GB of VRAM at first generation and nothing more at every next gen.

Harbitos commented 5 days ago

This is also happening for me, so I'm using directml. The first generation is normal, the next one shows that there is already little memory left, and then it generally writes that the memory has run out. I have an RX 580 8gb.

TheFerumn commented 5 days ago

Yeah. DirectML allocates all needed memory in first generation and doesn't take anything else even if i will queue 100 images to generate. I can just launch infinite generation and leave it be but ZLUDA will suck up everything in like 7 generations with FULL VAE method and like 18 pictures with TAESD. Fixing this issue would be huge since ZLUDA is 2x faster for me. I was already trying some changes in memory_management.py but nothing really helps. I will probably keep investigating it in my free time later.

Harbitos commented 5 days ago

Yeah. DirectML allocates all needed memory in first generation and doesn't take anything else even if i will queue 100 images to generate. I can just launch infinite generation and leave it be but ZLUDA will suck up everything in like 7 generations with FULL VAE method and like 18 pictures with TAESD. Fixing this issue would be huge since ZLUDA is 2x faster for me. I was already trying some changes in memory_management.py but nothing really helps. I will probably keep investigating it in my free time later.

And I even generate it a little slower, 1280*1280 with the model and lora sdxl. On directml 11min, on zluda 14min. I've already tried a lot of things, so I'll never go back to it.

lshqqytiger commented 2 days ago

DirectML often outperforms ZLUDA in older cards. In my environment (navi), I couldn't see such memory leaks. The memory leaks are possibly from HIP SDK. (they deprecated supports of gfx800/900 consumer cards and finally dropped in 6.1)

[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14384.42 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:11<00:00,  2.54it/s]
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14384.42 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14393.81 MB ... Done.
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14392.95 MB ... Done.
100%|##################################################################################| 28/28 [00:07<00:00,  3.51it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14392.45 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14382.70 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.21it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14382.45 MB ... Done.
100%|##################################################################################| 28/28 [00:07<00:00,  3.50it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14387.95 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14382.20 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.20it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14381.95 MB ... Done.
100%|##################################################################################| 28/28 [00:08<00:00,  3.50it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14387.45 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14381.70 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.20it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14381.45 MB ... Done.
100%|##################################################################################| 28/28 [00:08<00:00,  3.49it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14386.95 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14381.20 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.22it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14380.95 MB ... Done.
100%|##################################################################################| 28/28 [00:08<00:00,  3.49it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14389.95 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14382.20 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.21it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14381.95 MB ... Done.
100%|##################################################################################| 28/28 [00:08<00:00,  3.49it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14387.45 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14381.70 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.20it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14381.45 MB ... Done.
100%|##################################################################################| 28/28 [00:08<00:00,  3.49it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14386.95 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14381.20 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.20it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14380.95 MB ... Done.
100%|##################################################################################| 28/28 [00:08<00:00,  3.49it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14388.45 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14382.70 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.21it/s]
[Unload] Trying to free 1310.72 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14382.45 MB ... Done.
100%|##################################################################################| 28/28 [00:08<00:00,  3.49it/s]
[Unload] Trying to free 4356.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 14389.45 MB ... Done.
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 14381.70 MB ... Done.
Cleanup minimal inference memory.
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.58it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  4.40it/s]

(1024x1024, SDXL with 1 LoRA loaded)

TheFerumn commented 2 days ago

DirectML often outperforms ZLUDA in older cards. In my environment (navi), I couldn't see such memory leaks. The memory leaks are possibly from HIP SDK. (they deprecated supports of gfx800/900 consumer cards and finally dropped in 6.1)

I was afraid you going to say it. Do you think installing older HIP SDK might help or its not supported by Forge ? I wonder if i am able to somehow fix it in Rocmlibs or the issue is somewhere else but i guess its not worth the time

lshqqytiger commented 2 days ago

Unfortunately, the oldest HIP SDK for Windows was released 6 years later after RX 500 series was released. Also, it was just a year ago. I hardly expect the older one to improve this situation. You'd better stay in DirectML or just get newer hardware.

TheFerumn commented 1 day ago

Unfortunately, the oldest HIP SDK for Windows was released 6 years later after RX 500 series was released. Also, it was just a year ago. I hardly expect the older one to improve this situation. You'd better stay in DirectML or just get newer hardware.

Well its actually my 2nd PC to just generate some simple stuff in background. I tought maybe i can make it faster but i won't sweat about it. Close this issue if you are sure its not Forge related. I am just curious do you have any suspicions why there is such a huge difference between 2 VAE methods used ? Around 1 GB VRAM loss difference.

lshqqytiger commented 1 day ago

If the memory leaks occur also in webui-amdgpu, it is probably not forge related. About the vae methods, I'm not sure about the big gap as the reason for memory leaks is not obvious. However, I think it is because TAESD is designed for fast inference (but not "full" quality) so it is lighter and smaller than full vae.

lshqqytiger / stable-diffusion-webui-amdgpu-forge

Huge VAE memory leaks with --ZLUDA #56