Open trihardseven opened 5 months ago
I am having similar issues with loading checkpoints. The checkpoints do not seem to be retained in VRAM despite the settings and are loaded every single batch run. With four checkpoints on a 3060 12GB model, this takes three minutes each time for the initial step, and then three more for the refiner.
The program should retain the checkpoints and not load them every single time. Even with one checkpoint on the device at one time, and selecting to keep two checkpoints, total render time for four 1024x1024x40 step images is now almost four minutes, double what it was a week ago.
Is there a reason --always-gpu
is not a satisfactory resolution? If you want the most speed absolutely possible, and have a high VRAM card to support it, that is the best option to use. Adding something like --always-gpu-no-checkpoint
is confusing and is just going to slow you down in other ways.
In my case, uninstalling everything and reinstalling seems to have fixed things for now.
Nope... problem is back again. Seems to be one of the extensions, so will need to kill hem one by one.
Is there a reason
--always-gpu
is not a satisfactory resolution? If you want the most speed absolutely possible, and have a high VRAM card to support it, that is the best option to use. Adding something like--always-gpu-no-checkpoint
is confusing and is just going to slow you down in other ways.
@catboxanon the problem with --always-gpu is that it doesn't unload checkpoints, and I tend to swap them a lot and do xyz plots with multiple checkpoints for testing. Because it never unloads these full checkpoints I run out of vram even on my pc. If you think that idea is confusing I think just adding a keep LoRA networks in vram option is more than enough.
Maybe it's related to the upstream DEV version of AUTO1111?
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/14912
I second this, there should be a way to keep the model to your VRAM until you switch to another model
Edit: Don't use any flags and use those settings, and you'll get what you want.
Edit2: It won't work for controlnet models, especially InstantID because you need to use 3 models (the main one + ip-adapter + control_instant_id) at the same time. You get this annoying "Unload clone" thing and I don't know how to get rid of it
I second this, there should be a way to keep the model to your VRAM until you switch to another model
Edit: Don't use any flags and use those settings, and you'll get what you want.
Edit2: It won't work for controlnet models, especially InstantID because you need to use 3 models (the main one + ip-adapter + control_instant_id) at the same time. You get this annoying "Unload clone" thing and I don't know how to get rid of it
This has no effect on LoRAs though
I second this, there should be a way to keep the model to your VRAM until you switch to another model
Edit: Don't use any flags and use those settings, and you'll get what you want.
I have those settings and the latest version (pulled from AUTO 1.80RC) is still slow.
Same issue here. I had those issues in Fooocus and now in Forge. With 3090 and 24 Go. Checkpoint, Lora and Controlnet caching work perfectly fine on Automatic1111. I tried everything wth params (--always-gpu, and cache options in config).
I have the same problem on 3060(12 GB of memory), it all started as I update all installed extensions and Lora started slowing down image generation even on 1.7.0. P.s Seems to have found an extension that slows down generation twice for me. https://github.com/KohakuBlueleaf/a1111-sd-webui-lycoris
I'm having the same issue on a 4090. Adding --always-gpu dropped the initial 'moving model' time from ~3 seconds to ~1 second but I shouldn't need to reload a model between every image, no? I suspect this is Adetailer loading/unloading
Checklist
What happened?
Initially for me the main a1111 was faster on a 3090, so I found that it as an issue with moving models, and when using the command line --always-gpu it made my gens 5 seconds faster on a 3090. I read that the issue of moving models was fixed, but when using the latest Forge version it's still moving loras and 2+ seconds slower in comparison than with the command on. The "Number of Lora networks to keep cached in memory" setting is active, so I'm guessing it's storing it in the RAM and moving to VRAM. I would just use the command, but the problem is that when I change full checkpoints it doesn't unload the previous one, forcing me to restart the WebUI when changing checkpoints 2-3 times.
If this isn't just a bug my solution would be adding a checkbox to this setting that says "Keep lora networks in VRAM" or maybe a setting like --always-gpu-no-checkpoint that keeps every model except for checkpoints in VRAM for users with high VRAM cards.
Steps to reproduce the problem
What should have happened?
It should have kept a number of lora networks in the vram that matches the "Number of Lora networks to keep cached in memory" setting
What browsers do you use to access the UI ?
Brave
Sysinfo
sysinfo-.json
Console logs
Additional information
This bug might be happening because I have two gpus, not sure if everyone is having this issue. The gpu being used is the 3090 though, as you can see in the console.