Closed Reledia closed 1 year ago
To add: the jump to >20GB seems ot happen after switching models or loras (even with neither set to be cached).
take a look at #1357 and then tell me how you'd like to proceed?
Assuming that I know nothing about the webui architecture, I cannot find my situation in the quoted post. Removing the >20GB of RAM at the model switch (already mentioned because of pymalloc), the ~15GB of ram is reached even after only ~5 images produced with the same parameters. We are not talking about small leaks, but instead a jump of almost twice as much as when the program starts (which, if I understand correctly, should already have models and vae in memory). Regarding a possible solution, I don't know what to suggest or even where to start given my ignorance of the program
yes, those are huge jumps, i just wanted to make sure you've read through the basics first. i cannot reproduce with a basic workflow - generating basic txt2img at 512x512 without any scripts/extensions/hires/upscale/etc. so lets try that and introduce other operations until we know which operation causes this.
Safe mode (no loras/scripts/hires/tiled vae or any other script enabled):
Before 1st - ram: free:26.95 used:4.32 total:31.27
After 1st - ram: free:23.09 used:8.18 total:31.27
After 2nd - ram: free:22.74 used:8.53 total:31.27
After 3th - ram: free:22.74 used:8.53 total:31.27
After 4th - ram: free:21.56 used:9.71 total:31.27
After 5th - ram: free:19.95 used:11.32 total:31.27
After 6th - ram: free:20.07 used:11.2 total:31.27
After 7th - ram: free:20.07 used:11.2 total:31.27
After 8th - ram: free:18.58 used:12.69 total:31.27
After 9th - ram: free:18.57 used:12.7 total:31.27
After 10th - ram: free:16.74 used:14.53 total:31.27
Info: Steps: 25 | Sampler: UniPC | CFG scale: 8 | Seed: 441276843 | Size: 512x512 | Model hash: 77b7dc4ef0 | Model: meinamix_meinaV10 | VAE: ClearVAE_V2.3 | Clip skip: 2 | Version: baecfb7 | Token merging ratio: 0.4 | Token merging ratio hr: 0.6 | Token merging random: True | Token merging merge cross attention: True | Token merging merge mlp: True | Parser: Full parser | UniPC variant: bh2
Huh, weird, I thought --safe
mode would have disabled that, my bad. I am manually requesting generations one by one
safe mode disables user installed extensions, not much else right now. so does the problem happen without tome and can you reproduce using batch size > 1?
Safe mode + no tome and batch size 3:
Before 1st - ram: free:26.93 used:4.34 total:31.27
After 1st - ram: free:22.92 used:8.35 total:31.27
After 2nd - ram: free:21.3 used:9.96 total:31.27
After 3rd - ram: free:19.72 used:11.55 total:31.27
After 4th - ram: free:18.17 used:13.1 total:31.27
After 5th - ram: free:19.59 used:11.68 total:31.27
After 6th - ram: free:19.36 used:11.91 total:31.27
After 7th - ram: free:18.3 used:12.97 total:31.27
After 8th - ram: free:18.3 used:12.97 total:31.27
After 9th - ram: free:18.3 used:12.97 total:31.27
After 10th - ram: free:16.81 used:14.46 total:31.27
Info: Steps: 15 | Sampler: UniPC | CFG scale: 4 | Seed: -1 | Size: 512x512 | Model hash: 77b7dc4ef0 | Model: meinamix_meinaV10 | VAE: ClearVAE_V2.3 | Clip skip: 2 | Version: baecfb7 | Token merging ratio: 0 | Token merging ratio hr: 0 | Token merging random: True | Token merging merge attention: False | Parser: Full parser | UniPC variant: bh2
i just tried running 40 generate request sequentially, cannot get memory to move at all, utlization is flat. are you running with any kind of cmdline flags (such as medvram or lowvram)? purely for testing purposes, can you delete config.json so server recreates default config? you can back it up and copy it back later if you wish.
1) Yes, my bad. I use --medvram
2)
Medvram + No config.json + safe mode + no scripts/lora + batch size 1:
Before 1st - ram: free:26.91 used:4.36 total:31.27
After 1st - ram: free:22.96 used:8.3 total:31.27
After 2nd - ram: free:22.67 used:8.6 total:31.27
After 3rd - ram: free:21.19 used:10.08 total:31.27
After 4th - ram: free:21.34 used:9.93 total:31.27
After 5th - ram: free:21.34 used:9.93 total:31.27
After 6th - ram: free:21.02 used:10.25 total:31.27
After 7th - ram: free:21.34 used:9.93 total:31.27
After 8th - ram: free:21.32 used:9.94 total:31.27
After 9th - ram: free:21.08 used:10.19 total:31.27
After 10th - ram: free:21.34 used:9.93 total:31.27
It stays around max 11GB even with higher base resolution. With batch size 10 it keeps going up like ~0.5GB each time. Info: Steps: 15 | Sampler: UniPC | CFG scale: 4 | Seed: -1 | Size: 512x512 | Model hash: 590ae0d0b9 | Model: absolutereality_v10-inpainting | Conditional mask weight: 1.0 | Clip skip: 2 | Version: baecfb7 | Parser: Full parser
it has to be model move from gpu to cpu back and forth, can you try without medvram?
No medvram/lowvram + No config.json + safe mode + no scripts/lora + batch size 1:
Before 1st - ram: free:27.82 used:3.45 total:31.27
After 1st - ram: free:25.44 used:5.83 total:31.27
After 2nd - ram: free:25.6 used:5.67 total:31.27
After 3rd - ram: free:25.59 used:5.68 total:31.27
After 4th - ram: free:25.59 used:5.68 total:31.27
After 5th - ram: free:25.59 used:5.68 total:31.27
After 6th - ram: free:25.58 used:5.69 total:31.27
After 7th - ram: free:25.58 used:5.69 total:31.27
It stays this way without medvram even with batch size > 1. Thanks for your help :)
glad at least now you have a working environment
i'm going to keep this issue open until i can figure out what's going on with medvram
m2 pro, 32gb ram here, with medvram enabled python eats between 6.3 and 6.5gb ram with aZovya's model and Euler a sampler. This is regardless of number of image generations, seeds and prompts.
When I change to a different model within webui, python eats a few hundreds megs more even for smaller model but it doesn't grow over time. If anything it keeps going up and down so I assume there's some garbage collection in place.
Also, I don't really see any difference in medvram enabled/disabled apart from "Cached Files" (in Activity Monitor app) being almost double in size for medvram disabled.
However, I run Torch 2.1.0.dev20230608
so perhaps any existing memory leaks have been fixed in that version.
I also ran into this. Removing medvram and running with safe also fixed the issue. Can try to help reproduce if there's interest, would be nice to generate at higher resolutions with this fork.
the whole point of medvram/lowvram is to move parts of the model from ram to vram and back to save on vram. unless i'm misunderstanding something, m1/m2 do not have dedicated vram, so its pretty much pointless.
I'm on AMD hardware and experiencing this.
i cannot reproduce this issue with --medvram
, backend=original and batch=4x4 - my ram usage remains stable on both linux and windows.
can someone update if issue is still present?
closing as no updates, issue can be reopened if update is provided.
On my amd system on linux I keep finding this leak. After a certain number of generations the webui occupies all the ram on my pc (32 GB) and crashes the whole system. I'm using --medvram
and sub-quadratic optimization. All the models run in fp32 mode using --no-half
and --no-half-vae
.
I've been checking the code and found that the models are indeed supposed to be cached into the ram indefinitely (or at least that's what I seem to understand). However running memory profiler I can confirm that there's some sort of memory leak or at least a ram usage that I'm unable to explain. I'm attaching the plot of my ram usage:
I did more than 10 generations all with the same prompt and as you can see the ram usage slowly keeps increasing.
When doing generations using the without the --medvram
flag the ram usage stays linear:
If there's any more specific debugging I can do please tell me.
unfortunately, those graphs are not that useful as there is no way to tell what is in that memory? e.g., is it an application leak (app not dereferencing when it should) or a torch leak (torch may internally keep a handle on something thus preventing future gc) or simply python's eager memory allocator that does not free up defererenced objects.
and example of what can (and does) happen with medvram is that you end up with split-scenario where (for example) model unet may be in vram but vae in ram. sdnext does dereference the model, but gc may or may not collect parts that are split between vram and ram. and for me to trace where is which part to move them to one side just to make life easier on gc is a nightmare. add in multiple loras and possible embeddings and what not and its really complicated to track things. another example, if lora is loaded during model unload, is that supposed to be gc'd? but you cant trigger lora anything at that time since you're not able to interface with it.
and last is by far the most common - in entire python world, python memory allocator is eager and entire gc routine is best-effort.
for example, by replacing memory allocator with something like tcmalloc, you get far better behavior out of gc - with zero changes to the code - and that just shows how much it depends on internal memory allocator.
and yes, less chances of split references means less chances of confused gc. so model offloading or any kind or medvram (or especially lowvram) would be off the table then. but i'd guess you don't want medvram to be disabled, you want both.
sdnext has a built-in profiler which you can activate and go over each allocated object, but that is a massive task.
all-in-all, contributions are welcome, but unless there is actual proof that there is application caused leak this will not move forward much.
I understand, thanks for the explanation. I'll keep using it without --medvram
for now. Since I'm still learning python I'll try to study better the repo, if at any time I found a relevant clue I'll update here.
Issue Description
Issue: the program starts using around 5GB of ram. Zero models and vae are cached. After the first generation it goes up to 11GB, and then it linearly (not always like in the testings below) goes up of 1GB after each image. Models used are MeinaPastel_v5, ClearVAEV2.3 and Latent upscale,samplers are UniPC and 2M+++ SDE Karras (only ones I tried)
Safe mode:
Normal mode (adetelier used):
In this testing session it seems to stop at ~15GB, but it can usually go up to 20GB and more.
Version Platform Description
version: https://github.com/vladmandic/automatic/commit/baecfb7a13bbb3c8b63fa8d60ad829c1d9ece513
Acknowledgements