[Bug]: Forge WebUI taking almost one hour to generate a single image in Macbook Air M1

Checklist

[X] The issue exists after disabling all extensions
[X] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[X] The issue exists in the current version of the webui
[X] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

Txt2img is taking too long in Macbook Air M1. After running a fresh install of the git repository, added a model, a VAE, an some LoRas, the generation process takes almost 1 hour per image of 768x1152, with/without upscaling/refining. Is this normal?

Steps to reproduce the problem

On a fresh install of Forge WebUI in Macbook Air M1, using this configuration:

score_9, score_8_up, score_7_up, score_6_up,OverallDetail, 1girl, solo, (tiefling), very long hair, white hair, bangs, ponytail, braided hair, long pointed ear, black makeup, thin body, (white skin), thigh gap, tail tiefling, black gloves, erotic pose, (sexy red clothing, sexy black stockings), (provocative look) ,concept art,illustration,realistic,Expressiveh,knva,perfect body,highly detailed,delicate and smooth skin, body in motion, Negative prompt: score_6, score_5, score_4, negativeXL_D, 3d Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 666, Size: 768x1152, Model hash: 67ab2fd8ec, Model: ponyDiffusionV6XL_v6StartWithThisOne, VAE hash: 2125bad8d3, VAE: xlVAEC_f1.safetensors, Denoising strength: 0.7, Hires upscale: 1.5, Hires steps: 5, Hires upscaler: Latent, Lora hashes: "add-detail-xl: 9c783c8ce46c, Concept Art Twilight Style SDXL_LoRA_Pony Diffusion V6 XL: e5fe96cd307b, Expressive_H: 5671f20a9a6b, Kenva: cfa45d23d34c, sinfully_stylish_SDKL: 076fa4d920a9, xl_more_art-full_v1: fe3b4816be83", TI hashes: "negativeXL_D: fff5d51ab655, negativeXL_D: fff5d51ab655, negativeXL_D: fff5d51ab655, negativeXL_D: fff5d51ab655", Version: f0.0.17v1.8.0rc-latest-276-g29be1da7

Time taken: 47 min. 26.2 sec.

What should have happened?

As far as I understand, long generation process shouldn't take longer than 5-10min. One hour seems excessive to me, but maybe I am missing something important here.

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo.txt

Console logs

 USERNAME@HOME      stable-diffusion-webui-forge  main  ./webui.sh

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on USERNAME user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)]
Version: f0.0.17v1.8.0rc-latest-276-g29be1da7
Commit hash: 29be1da7cf2b5dccfc70fbdd33eb35c56a31ffb7
Legacy Preprocessor init warning: Unable to install insightface automatically. Please try run `pip install insightface` manually.
Launching Web UI with arguments: --skip-torch-cuda-test --upcast-sampling --no-half-vae --use-cpu interrogate
Total VRAM 16384 MB, total RAM 16384 MB
Set vram state to: SHARED
Device: mps
VAE dtype: torch.float32
CUDA Stream Activated:  False
Warning: caught exception 'Torch not compiled with CUDA enabled', memory monitor disabled
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --attention-split
==============================================================================
You are running torch 2.1.0.
The program is tested to work with torch 2.1.2.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Use --skip-version-check commandline argument to disable this check.
==============================================================================
ControlNet preprocessor location: /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/ControlNetPreprocessor
Loading weights [67ab2fd8ec] from /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Stable-diffusion/ponyDiffusionV6XL_v6StartWithThisOne.safetensors
2024-05-16 08:24:35,896 - ControlNet - INFO - ControlNet UI callback registered.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
model_type EPS
UNet ADM Dimension 2816
Startup time: 10.9s (prepare environment: 0.6s, import torch: 3.4s, import gradio: 1.2s, setup paths: 1.3s, other imports: 1.6s, load scripts: 1.3s, create ui: 0.6s, gradio launch: 0.7s).
Using split attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using split attention in VAE
extra {'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_l.logit_scale'}
Loading VAE weights specified in settings: /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/VAE/xlVAEC_f1.safetensors
To load target model SDXLClipModel
Begin to load 1 model
Moving model(s) has taken 0.01 seconds
Model loaded in 18.9s (load weights from disk: 0.9s, forge load real models: 15.0s, load VAE: 0.5s, calculate empty prompt: 2.4s).
                                  [LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/add-detail-xl.safetensors for SDXL-UNet with 722 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/add-detail-xl.safetensors for SDXL-CLIP with 264 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/Concept Art Twilight Style SDXL_LoRA_Pony Diffusion V6 XL.safetensors for SDXL-UNet with 722 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/Concept Art Twilight Style SDXL_LoRA_Pony Diffusion V6 XL.safetensors for SDXL-CLIP with 264 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/Expressive_H-000001.safetensors for SDXL-UNet with 722 keys at weight 0.8 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/Expressive_H-000001.safetensors for SDXL-CLIP with 264 keys at weight 0.8 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/Kenva.safetensors for SDXL-UNet with 722 keys at weight 0.7 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/Kenva.safetensors for SDXL-CLIP with 264 keys at weight 0.7 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/sinfully_stylish_SDXL.safetensors for SDXL-UNet with 722 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/sinfully_stylish_SDXL.safetensors for SDXL-CLIP with 264 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded /Users/USERNAME/Projects/stable-diffusion-webui-forge/models/Lora/xl_more_art-full_v1.safetensors for SDXL-UNet with 788 keys at weight 0.7 (skipped 0 keys)
To load target model SDXLClipModel
Begin to load 1 model
Reuse 1 loaded models
Moving model(s) has taken 5.69 seconds
To load target model SDXL
Begin to load 1 model
Moving model(s) has taken 63.00 seconds
 40%|█████████████████▌                          | 8/20 [12:13<18:19, 91.65s/it]
To load target model AutoencoderKL              | 8/25 [13:31<29:04, 102.60s/it]
Begin to load 1 model
Moving model(s) has taken 2.36 seconds
Total progress:  32%|████████▋                  | 8/25 [13:43<29:10, 102.97s/it]
Total progress:  32%|████████▋                  | 8/25 [13:43<29:04, 102.60s/it]

Additional information

No response

The Apple M1, notably downclocked in a Macbook Air, just isn't a SoC that is designed for any kind of GPU-intensive workloads. Diffusion models aren't just GPU-intensive, they are GPU-heavyweights. The M1 is comparable to a Nvidia 970 Mobile in most usecases, whereas people even struggle with 3080s with generation time these days.

Imagine your boss tells you you have 15 minutes to move a 2000 pound car, that's not yours, to the next parking spot beside it. The car is parked and the handbrake is engaged. You might be able to do it somehow, but it's not gonna happen in 15 minutes.

You should look at renting a GPU. There's a lot of services that even pre-configure SD with A1111 for you for reasonable prices.

I agree. I love Macs and just bought a new MacBook Pro for making music (Windows audio subsystem is a nightmare), but SD is too demanding for even the high-end Pro models. It will run CPU-only, and on AMD/Intel GPUs, but since it was designed for Nvidia hardware, using CUDA/Torch binaries, using other hardware is still a compromise. Hopefully that will change.

But now that you know your way around a repo, you can use Runpod to run Stable Forge and it will not only be faster than any of us local users (use H100s/A100s with 80GB VRAM!), but it will likely be cheaper too! For example, you don't have to use it every day in order to get your money's worth, as with a PC/GPU purchase.

I was thinking of trying this out, I'm glad I decided to check first. At 512x512 an M1 Mac Studio does about 1.5 it/s using standard webui; an M2 MacBook Air is about 30% slower. For SDXL at 1024x1024 the Mac Studio does 3s/it. Stable Diffusion is significantly faster using native CoreML instead of PyTorch, but it's not feasible to use that from python.

I want to use Forge because it offers numerous ready-to-go integrations and has plenty of tutorials on YouTube that cover Forge's interface.

However, using the brand new Regular WebUI install on my 18GB unified memory M3Pro MacBook, I achieved around 3~5 it/s with a SDXL Lightning model, with no other extensions active. On Euler, a 10-step process takes less than 50s to complete, and the Python process uses about 10-12GB of RAM, as shown in the Activity Monitor.

In contrast, with a fresh install of Forge, the same SDXL model runs about 10 times slower at 53 iterations per second, and the Python process consumes 16GB of RAM.

In short, if the regular WebUI operates at an acceptable speed, why can't Forge?

If Forge has no intention of optimizing for Mac and cannot deliver the claimed speed boost on the settings, it should be explicitly stated in the main README.md.

If Forge is not optimized for Mac (and/or may never consider), the README should clearly inform Mac users. This will help users make informed decisions and avoid unnecessary frustration.

If Forge is not optimized for Mac (and/or may never consider), the README should clearly inform Mac users. This will help users make informed decisions and avoid unnecessary frustration.

AFAIK, the dev has no mention of 'MacOS' anywhere in the repo's docs, let alone any claims regarding Mac hardware. SF probably only supports Mac incidentally, because Automatic1111 does. My guess is he hasn't changed Mac-related code, besides compatibility fixes.

In other words, A1111 will be about as good, if not better/faster. I'd switch to that, same as I did on my PC. No one has worked on Forge in 2 months

That's the TLDR. If you need more reasons, here:

After all, Apple said this, not these devs: "Apple’s M1 and M2 chips are technological marvels, sporting 8-core CPUs optimized for AI training. These chips are engineered for top-notch performance, significantly accelerating AI and machine learning tasks."

But even if you believed it, wouldn't you want to see just few benchmarks? Then 5 seconds with Google gives hundreds of results filled with graphs like this:

Yet you're here and frustrated with our dev, because you were led to believe it would be fast for Macs, too? He never made that claim anywhere I'm aware of. Correct me if I'm wrong. He mentions FoOocus is compatible, but that's it.

You and any other Mac users reading this have every right to be angry and frustrated. With Apple. They lied to you. You have no right to blame any of it on @Illyasviel here, though. The blame lies with Apple, or your decision not to use Windows.

Yet you're here and frustrated with our dev, because you were led to believe it would be fast for Macs, too? He never made that claim anywhere I'm aware of. Correct me if I'm wrong. He mentions FoOocus is compatible, but that's it.

None of my comments blame anyone or indicate that I'm exclusively a Mac user. I don't understand why you're reacting so offensively. In fact, I have 3 Windows laptops, 2 Windows desktops, and only one Mac. I use my Mac because it's much more portable compared to my 3kg (~7lb) Alienware.

I simply pointed out that, under the same user settings, Forge runs extremely slowly on Mac compared to the base A1111.

My main question remains: If A1111 runs fine, why can't Forge? This suggests that the issue isn't related to hardware or chips, but to software configuration done in Forge. It is likely something could be configured such that Forge can run at the same speed as base A1111, while still having the benefit of the integrated apps/extensions and improved UI.

If Forge is not optimized for Mac (and/or may never consider), the README should clearly inform Mac users. This will help users make informed decisions and avoid unnecessary frustration.

I mentioned that if Forge will never be optimized for Mac, a simple statement to that effect would suffice. Just a sentence or two indicating that Forge runs but isn't optimized for Mac would clarify things. This would allow all Mac-related issues/questions to be closed, if maintainers have no intention of addressing them.

You see, this issue was opened last week, and Forge lacks clear communication on this particular subject. Your response feels like concerns for Mac are being dismissed with "why don't you use Windows?" rather than being addressed as "our optimization targets Windows systems only, and Mac users might not be able to enjoy the speed benefits."

If you are the core developer/maintainer of this repo, please either confirm that no support is coming, act on the subject and write some code, or provide insights on where to make the necessary changes to potentially update Forge for Mac support. If not, let the actual developers speak for themselves on the subject.

I'm sorry if you felt attacked. I shouldn't have replied, since I don't have a solution and can't speak for the dev.

Here's all I know and I hope it's of more use to you than my last response:

It's my understanding that this repo is a GPU-focused version of A1111 with a lot of new NVidia features implemented, which come from the developments in CUDA since the release of the more stable version that A1111 uses. By now, A1111 could have incorporated much of it. That was the goal of the repo, I think; as a testbed for A1111.

It's my understanding that most, or all of the performance comes from implementing new Nvidia code, which is why there probably won't be any Mac-specific enhancements here, as it appears that all the enhancements in SF are specific to discreet AMD/Nvidia GPUs.

As for whether or not we'll ever see Mac-optimized code here, who knows? It certainly should, if he plans to develop for A1111 more. This is the guy who created ControlNet, after all, so perhaps he'll enjoy working with the limitations inherent in integrated on-chip graphics.

I also should mention that there have been no changes to the repo since I first cloned it several months back. The developer is MIA, or I never would have attempted to answer for him. You waited a week, yes. There are 260+ open issues, here in addition to yours. Some are very serious.

I've never seen such an amazing and promising project left to go stale like this before. It's a shame, but as a fellow SD user, I recommend you switch back to A1111. Hopefully, the active team over there can make use of the code and we'll see all the same improvements in time.

My main question remains: If A1111 runs fine, why can't Forge?

I'm clueless as to why they've resulted in a major speed hit for you, but have you compared all the settings to see which are set differently? Because maybe there are some settings enabled by default in SF (which favor Nvidia GPUs), that are causing SF's speed to tank? Just a thought. Honestly, I'd stick with A1111 for now.

Looking at the console... "Legacy Preprocessor init warning: Unable to install insightface automatically. Please try run pip install insightface manually."

"Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --attention-split"

"You are running torch 2.1.0. The program is tested to work with torch 2.1.2. To reinstall the desired version, run with commandline flag --reinstall-torch."

Those 3 things stand out to me as potential fixes. Did you try those?

I actually have the opposite issue as you, Forge runs amazingly and Auto1111 runs like... well, it doesn't. A gen that takes 43 seconds in Forge will take over 11 minutes to complete in A1111. I blame gradio bloat, but what do I know?

Also...

...added a model, a VAE, an some LoRas...

There appears to be a bug in which under certain circumstances with SDXL, adding a LoRa can absolutely trash generation speed. I find that in my cases, it's usually the first gen after loading a checkpoint. Given that you're using SDXL, this may apply? One quick and dirty fix I have found is to start a generation, then immediately cancel it. Subsequent generations USUALLY run fine, but that may also be limited to me and my machine setup. Could be that all those extra LoRas are really slowing it down? Have you tried without LoRas to start and see if it does run faster?

I also see it says that it took 63 seconds to move a model, and that seems incredibly slow. Seems there is some configuration issue going on there. I know people seem to be having ControlNet issues with Forge lately, and that may be part of it - I am also in that group though it isn't really affecting my performance, just what I can actually get done and how I have to do it.

Maybe start with the low hanging fruit (updating/trying console suggestions, etc) and see what that does? I did run on version 2.1.0 of torch for a long time without issues, but 2 completely different machines, setups, and architectures. Good luck.

I have the same issue on Macbook, really slow, I tried to update the torch and others, but seems not helping. But I saw others working well on Macbook in youtube. Not sure how to achieve that.

lllyasviel / stable-diffusion-webui-forge