vladmandic / automatic

SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models
https://github.com/vladmandic/automatic
GNU Affero General Public License v3.0
5.61k stars 411 forks source link

[Issue]: Medvram causes RAM Leak #1384

Closed Reledia closed 1 year ago

Reledia commented 1 year ago

Issue Description

Issue: the program starts using around 5GB of ram. Zero models and vae are cached. After the first generation it goes up to 11GB, and then it linearly (not always like in the testings below) goes up of 1GB after each image. Models used are MeinaPastel_v5, ClearVAEV2.3 and Latent upscale,samplers are UniPC and 2M+++ SDE Karras (only ones I tried)

Safe mode:

Before first generation: 'ram: free:26.97 used:4.3 total:31.27'
After first generation: 'ram: free:20.48 used:10.79 total:31.27'
After second generation: 'ram: free:18.81 used:12.46 total:31.27'
after third generation: 'ram: free:16.87 used:14.4 total:31.27'
After forth generation: 'ram: free:17.31 used:13.96 total:31.27'
...
After seventh generation: 'ram: free:16.6 used:14.67 total:31.27'

Normal mode (adetelier used):

Before first generation: 'ram: free:26.98 used:4.29 total:31.27'
After first generation: 'ram: free:19.93 used:11.34 total:31.27'
After second generation: 'ram: free:17.79 used:13.48 total:31.27'
...
After fifth generation: 'ram: free:17.03 used:14.24 total:31.27'

In this testing session it seems to stop at ~15GB, but it can usually go up to 20GB and more.

Version Platform Description

version: https://github.com/vladmandic/automatic/commit/baecfb7a13bbb3c8b63fa8d60ad829c1d9ece513

16:10:43-604476 INFO     Python 3.10.11 on Linux                                                                                              
16:10:43-608758 INFO     Version: baecfb7a Sat Jun 10 08:06:29 2023 -0400                                                                     
16:10:43-739737 INFO     Setting environment tuning                                                                                           
16:10:43-740550 INFO     nVidia CUDA toolkit detected                                                                                         
16:10:44-444555 INFO     Torch 2.0.1+cu118                                                                                                    
16:10:44-453755 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700                                                                           
16:10:44-464177 INFO     Torch detected GPU: NVIDIA GeForce RTX 3060 Ti VRAM 7971 Arch (8, 6) Cores 38                                        
16:10:44-486468 INFO     Disabled extensions-builtin: ['Segment Anything', 'locon', 'Civitai Lycoris', 'RCFG Scheduler', 'OneButtonPrompt',   
                         'Unprompted', 'openpose-editor', 'SAG', 'Cutoff', 'openOutpaint-webUI-extension', 'sd_dreambooth_extension',         
                         'stable-diffusion-webui-composable-lora', 'sd_web_ui_preset_utils']                                                  
16:10:44-487666 INFO     Enabled extensions-builtin: ['LDSR', 'Lora', 'ScuNET', 'SwinIR', 'a1111-sd-webui-lycoris', 'clip-interrogator-ext',  
                         'multidiffusion-upscaler-for-automatic1111', 'sd-dynamic-thresholding', 'sd-extension-aesthetic-scorer',             
                         'sd-extension-steps-animation', 'sd-extension-system-info', 'sd-webui-controlnet', 'sd-webui-model-converter',       
                         'seed_travel', 'stable-diffusion-webui-images-browser', 'stable-diffusion-webui-rembg', 'sd-webui-agent-scheduler']  
16:10:44-489038 INFO     Disabled extensions: ['Segment Anything', 'locon', 'Civitai Lycoris', 'RCFG Scheduler', 'OneButtonPrompt',           
                         'Unprompted', 'openpose-editor', 'SAG', 'Cutoff', 'openOutpaint-webUI-extension', 'sd_dreambooth_extension',         
                         'stable-diffusion-webui-composable-lora', 'sd_web_ui_preset_utils']                                                  
16:10:44-489789 INFO     Enabled extensions: ['Civitai Helper', 'stable-diffusion-webui', 'sd-webui-aspect-ratio-helper', 'tagger',           
                         'ultimate-upscale-for-automatic1111', 'Presets', 'Adetailer', 'sd-webui-regional-prompter', 'DynamicPrompts',        
                         'Autocomplete', 'Photopea', 'sd-webui-openpose-editor']                                                              
16:11:04-771761 INFO     Extensions enabled: ['LDSR', 'Lora', 'ScuNET', 'SwinIR', 'a1111-sd-webui-lycoris', 'clip-interrogator-ext',          
                         'multidiffusion-upscaler-for-automatic1111', 'sd-dynamic-thresholding', 'sd-extension-aesthetic-scorer',             
                         'sd-extension-steps-animation', 'sd-extension-system-info', 'sd-webui-controlnet', 'sd-webui-model-converter',       
                         'seed_travel', 'stable-diffusion-webui-images-browser', 'stable-diffusion-webui-rembg', 'sd-webui-agent-scheduler',  
                         'Civitai Helper', 'stable-diffusion-webui', 'sd-webui-aspect-ratio-helper', 'tagger',                                
                         'ultimate-upscale-for-automatic1111', 'Presets', 'Adetailer', 'sd-webui-regional-prompter', 'DynamicPrompts',        
                         'Autocomplete', 'Photopea', 'sd-webui-openpose-editor']  

Acknowledgements

Reledia commented 1 year ago

To add: the jump to >20GB seems ot happen after switching models or loras (even with neither set to be cached).

vladmandic commented 1 year ago

take a look at #1357 and then tell me how you'd like to proceed?

Reledia commented 1 year ago

Assuming that I know nothing about the webui architecture, I cannot find my situation in the quoted post. Removing the >20GB of RAM at the model switch (already mentioned because of pymalloc), the ~15GB of ram is reached even after only ~5 images produced with the same parameters. We are not talking about small leaks, but instead a jump of almost twice as much as when the program starts (which, if I understand correctly, should already have models and vae in memory). Regarding a possible solution, I don't know what to suggest or even where to start given my ignorance of the program

vladmandic commented 1 year ago

yes, those are huge jumps, i just wanted to make sure you've read through the basics first. i cannot reproduce with a basic workflow - generating basic txt2img at 512x512 without any scripts/extensions/hires/upscale/etc. so lets try that and introduce other operations until we know which operation causes this.

Reledia commented 1 year ago

Safe mode (no loras/scripts/hires/tiled vae or any other script enabled):

Before 1st - ram: free:26.95 used:4.32 total:31.27
After 1st - ram: free:23.09 used:8.18 total:31.27
After 2nd - ram: free:22.74 used:8.53 total:31.27
After 3th - ram: free:22.74 used:8.53 total:31.27
After 4th - ram: free:21.56 used:9.71 total:31.27
After 5th - ram: free:19.95 used:11.32 total:31.27
After 6th - ram: free:20.07 used:11.2 total:31.27
After 7th - ram: free:20.07 used:11.2 total:31.27
After 8th - ram: free:18.58 used:12.69 total:31.27
After 9th - ram: free:18.57 used:12.7 total:31.27
After 10th - ram: free:16.74 used:14.53 total:31.27

Info: Steps: 25 | Sampler: UniPC | CFG scale: 8 | Seed: 441276843 | Size: 512x512 | Model hash: 77b7dc4ef0 | Model: meinamix_meinaV10 | VAE: ClearVAE_V2.3 | Clip skip: 2 | Version: baecfb7 | Token merging ratio: 0.4 | Token merging ratio hr: 0.6 | Token merging random: True | Token merging merge cross attention: True | Token merging merge mlp: True | Parser: Full parser | UniPC variant: bh2

vladmandic commented 1 year ago
Reledia commented 1 year ago

Huh, weird, I thought --safe mode would have disabled that, my bad. I am manually requesting generations one by one

vladmandic commented 1 year ago

safe mode disables user installed extensions, not much else right now. so does the problem happen without tome and can you reproduce using batch size > 1?

Reledia commented 1 year ago

Safe mode + no tome and batch size 3:

Before 1st - ram: free:26.93 used:4.34 total:31.27
After 1st - ram: free:22.92 used:8.35 total:31.27
After 2nd - ram: free:21.3 used:9.96 total:31.27
After 3rd - ram: free:19.72 used:11.55 total:31.27
After 4th - ram: free:18.17 used:13.1 total:31.27
After 5th - ram: free:19.59 used:11.68 total:31.27
After 6th - ram: free:19.36 used:11.91 total:31.27
After 7th - ram: free:18.3 used:12.97 total:31.27
After 8th - ram: free:18.3 used:12.97 total:31.27
After 9th - ram: free:18.3 used:12.97 total:31.27
After 10th - ram: free:16.81 used:14.46 total:31.27

Info: Steps: 15 | Sampler: UniPC | CFG scale: 4 | Seed: -1 | Size: 512x512 | Model hash: 77b7dc4ef0 | Model: meinamix_meinaV10 | VAE: ClearVAE_V2.3 | Clip skip: 2 | Version: baecfb7 | Token merging ratio: 0 | Token merging ratio hr: 0 | Token merging random: True | Token merging merge attention: False | Parser: Full parser | UniPC variant: bh2

vladmandic commented 1 year ago

i just tried running 40 generate request sequentially, cannot get memory to move at all, utlization is flat. are you running with any kind of cmdline flags (such as medvram or lowvram)? purely for testing purposes, can you delete config.json so server recreates default config? you can back it up and copy it back later if you wish.

Reledia commented 1 year ago

1) Yes, my bad. I use --medvram 2) Medvram + No config.json + safe mode + no scripts/lora + batch size 1:

Before 1st - ram: free:26.91 used:4.36 total:31.27
After 1st - ram: free:22.96 used:8.3 total:31.27
After 2nd - ram: free:22.67 used:8.6 total:31.27
After 3rd - ram: free:21.19 used:10.08 total:31.27
After 4th - ram: free:21.34 used:9.93 total:31.27
After 5th - ram: free:21.34 used:9.93 total:31.27
After 6th - ram: free:21.02 used:10.25 total:31.27
After 7th - ram: free:21.34 used:9.93 total:31.27
After 8th - ram: free:21.32 used:9.94 total:31.27
After 9th - ram: free:21.08 used:10.19 total:31.27
After 10th - ram: free:21.34 used:9.93 total:31.27

It stays around max 11GB even with higher base resolution. With batch size 10 it keeps going up like ~0.5GB each time. Info: Steps: 15 | Sampler: UniPC | CFG scale: 4 | Seed: -1 | Size: 512x512 | Model hash: 590ae0d0b9 | Model: absolutereality_v10-inpainting | Conditional mask weight: 1.0 | Clip skip: 2 | Version: baecfb7 | Parser: Full parser

vladmandic commented 1 year ago

it has to be model move from gpu to cpu back and forth, can you try without medvram?

Reledia commented 1 year ago

No medvram/lowvram + No config.json + safe mode + no scripts/lora + batch size 1:

Before 1st - ram: free:27.82 used:3.45 total:31.27
After 1st - ram: free:25.44 used:5.83 total:31.27
After 2nd - ram: free:25.6 used:5.67 total:31.27
After 3rd - ram: free:25.59 used:5.68 total:31.27
After 4th - ram: free:25.59 used:5.68 total:31.27
After 5th - ram: free:25.59 used:5.68 total:31.27
After 6th - ram: free:25.58 used:5.69 total:31.27
After 7th - ram: free:25.58 used:5.69 total:31.27

It stays this way without medvram even with batch size > 1. Thanks for your help :)

vladmandic commented 1 year ago

glad at least now you have a working environment
i'm going to keep this issue open until i can figure out what's going on with medvram

mariodian commented 1 year ago

m2 pro, 32gb ram here, with medvram enabled python eats between 6.3 and 6.5gb ram with aZovya's model and Euler a sampler. This is regardless of number of image generations, seeds and prompts.

When I change to a different model within webui, python eats a few hundreds megs more even for smaller model but it doesn't grow over time. If anything it keeps going up and down so I assume there's some garbage collection in place.

Also, I don't really see any difference in medvram enabled/disabled apart from "Cached Files" (in Activity Monitor app) being almost double in size for medvram disabled.

However, I run Torch 2.1.0.dev20230608 so perhaps any existing memory leaks have been fixed in that version.

itsagoodbrain commented 1 year ago

I also ran into this. Removing medvram and running with safe also fixed the issue. Can try to help reproduce if there's interest, would be nice to generate at higher resolutions with this fork.

vladmandic commented 1 year ago

the whole point of medvram/lowvram is to move parts of the model from ram to vram and back to save on vram. unless i'm misunderstanding something, m1/m2 do not have dedicated vram, so its pretty much pointless.

itsagoodbrain commented 1 year ago

I'm on AMD hardware and experiencing this.

vladmandic commented 1 year ago

i cannot reproduce this issue with --medvram, backend=original and batch=4x4 - my ram usage remains stable on both linux and windows.

can someone update if issue is still present?

vladmandic commented 1 year ago

closing as no updates, issue can be reopened if update is provided.

daniandtheweb commented 11 months ago

On my amd system on linux I keep finding this leak. After a certain number of generations the webui occupies all the ram on my pc (32 GB) and crashes the whole system. I'm using --medvram and sub-quadratic optimization. All the models run in fp32 mode using --no-half and --no-half-vae.

daniandtheweb commented 11 months ago

I've been checking the code and found that the models are indeed supposed to be cached into the ram indefinitely (or at least that's what I seem to understand). However running memory profiler I can confirm that there's some sort of memory leak or at least a ram usage that I'm unable to explain. I'm attaching the plot of my ram usage: Figure_1 I did more than 10 generations all with the same prompt and as you can see the ram usage slowly keeps increasing. When doing generations using the without the --medvram flag the ram usage stays linear: Figure_2

If there's any more specific debugging I can do please tell me.

vladmandic commented 11 months ago

unfortunately, those graphs are not that useful as there is no way to tell what is in that memory? e.g., is it an application leak (app not dereferencing when it should) or a torch leak (torch may internally keep a handle on something thus preventing future gc) or simply python's eager memory allocator that does not free up defererenced objects.

and example of what can (and does) happen with medvram is that you end up with split-scenario where (for example) model unet may be in vram but vae in ram. sdnext does dereference the model, but gc may or may not collect parts that are split between vram and ram. and for me to trace where is which part to move them to one side just to make life easier on gc is a nightmare. add in multiple loras and possible embeddings and what not and its really complicated to track things. another example, if lora is loaded during model unload, is that supposed to be gc'd? but you cant trigger lora anything at that time since you're not able to interface with it.

and last is by far the most common - in entire python world, python memory allocator is eager and entire gc routine is best-effort.

for example, by replacing memory allocator with something like tcmalloc, you get far better behavior out of gc - with zero changes to the code - and that just shows how much it depends on internal memory allocator.

and yes, less chances of split references means less chances of confused gc. so model offloading or any kind or medvram (or especially lowvram) would be off the table then. but i'd guess you don't want medvram to be disabled, you want both.

sdnext has a built-in profiler which you can activate and go over each allocated object, but that is a massive task.

all-in-all, contributions are welcome, but unless there is actual proof that there is application caused leak this will not move forward much.

daniandtheweb commented 11 months ago

I understand, thanks for the explanation. I'll keep using it without --medvram for now. Since I'm still learning python I'll try to study better the repo, if at any time I found a relevant clue I'll update here.