Open fergardi opened 1 month ago
The Apple M1, notably downclocked in a Macbook Air, just isn't a SoC that is designed for any kind of GPU-intensive workloads. Diffusion models aren't just GPU-intensive, they are GPU-heavyweights. The M1 is comparable to a Nvidia 970 Mobile in most usecases, whereas people even struggle with 3080s with generation time these days.
Imagine your boss tells you you have 15 minutes to move a 2000 pound car, that's not yours, to the next parking spot beside it. The car is parked and the handbrake is engaged. You might be able to do it somehow, but it's not gonna happen in 15 minutes.
You should look at renting a GPU. There's a lot of services that even pre-configure SD with A1111 for you for reasonable prices.
I agree. I love Macs and just bought a new MacBook Pro for making music (Windows audio subsystem is a nightmare), but SD is too demanding for even the high-end Pro models. It will run CPU-only, and on AMD/Intel GPUs, but since it was designed for Nvidia hardware, using CUDA/Torch binaries, using other hardware is still a compromise. Hopefully that will change.
But now that you know your way around a repo, you can use Runpod to run Stable Forge and it will not only be faster than any of us local users (use H100s/A100s with 80GB VRAM!), but it will likely be cheaper too! For example, you don't have to use it every day in order to get your money's worth, as with a PC/GPU purchase.
I was thinking of trying this out, I'm glad I decided to check first. At 512x512 an M1 Mac Studio does about 1.5 it/s using standard webui; an M2 MacBook Air is about 30% slower. For SDXL at 1024x1024 the Mac Studio does 3s/it. Stable Diffusion is significantly faster using native CoreML instead of PyTorch, but it's not feasible to use that from python.
I want to use Forge because it offers numerous ready-to-go integrations and has plenty of tutorials on YouTube that cover Forge's interface.
However, using the brand new Regular WebUI install on my 18GB unified memory M3Pro MacBook, I achieved around 3~5 it/s with a SDXL Lightning model, with no other extensions active. On Euler, a 10-step process takes less than 50s to complete, and the Python process uses about 10-12GB of RAM, as shown in the Activity Monitor.
In contrast, with a fresh install of Forge, the same SDXL model runs about 10 times slower at 53 iterations per second, and the Python process consumes 16GB of RAM.
In short, if the regular WebUI operates at an acceptable speed, why can't Forge?
If Forge has no intention of optimizing for Mac and cannot deliver the claimed speed boost on the settings, it should be explicitly stated in the main README.md.
If Forge is not optimized for Mac (and/or may never consider), the README should clearly inform Mac users. This will help users make informed decisions and avoid unnecessary frustration.
If Forge is not optimized for Mac (and/or may never consider), the README should clearly inform Mac users. This will help users make informed decisions and avoid unnecessary frustration.
AFAIK, the dev has no mention of 'MacOS' anywhere in the repo's docs, let alone any claims regarding Mac hardware. SF probably only supports Mac incidentally, because Automatic1111 does. My guess is he hasn't changed Mac-related code, besides compatibility fixes.
In other words, A1111 will be about as good, if not better/faster. I'd switch to that, same as I did on my PC. No one has worked on Forge in 2 months
That's the TLDR. If you need more reasons, here:
After all, Apple said this, not these devs: "Apple’s M1 and M2 chips are technological marvels, sporting 8-core CPUs optimized for AI training. These chips are engineered for top-notch performance, significantly accelerating AI and machine learning tasks."
But even if you believed it, wouldn't you want to see just few benchmarks? Then 5 seconds with Google gives hundreds of results filled with graphs like this:
Yet you're here and frustrated with our dev, because you were led to believe it would be fast for Macs, too? He never made that claim anywhere I'm aware of. Correct me if I'm wrong. He mentions FoOocus is compatible, but that's it.
You and any other Mac users reading this have every right to be angry and frustrated. With Apple. They lied to you. You have no right to blame any of it on @Illyasviel here, though. The blame lies with Apple, or your decision not to use Windows.
Yet you're here and frustrated with our dev, because you were led to believe it would be fast for Macs, too? He never made that claim anywhere I'm aware of. Correct me if I'm wrong. He mentions FoOocus is compatible, but that's it.
None of my comments blame anyone or indicate that I'm exclusively a Mac user. I don't understand why you're reacting so offensively. In fact, I have 3 Windows laptops, 2 Windows desktops, and only one Mac. I use my Mac because it's much more portable compared to my 3kg (~7lb) Alienware.
I simply pointed out that, under the same user settings, Forge runs extremely slowly on Mac compared to the base A1111.
My main question remains: If A1111 runs fine, why can't Forge? This suggests that the issue isn't related to hardware or chips, but to software configuration done in Forge. It is likely something could be configured such that Forge can run at the same speed as base A1111, while still having the benefit of the integrated apps/extensions and improved UI.
If Forge is not optimized for Mac (and/or may never consider), the README should clearly inform Mac users. This will help users make informed decisions and avoid unnecessary frustration.
I mentioned that if Forge will never be optimized for Mac, a simple statement to that effect would suffice. Just a sentence or two indicating that Forge runs but isn't optimized for Mac would clarify things. This would allow all Mac-related issues/questions to be closed, if maintainers have no intention of addressing them.
You see, this issue was opened last week, and Forge lacks clear communication on this particular subject. Your response feels like concerns for Mac are being dismissed with "why don't you use Windows?" rather than being addressed as "our optimization targets Windows systems only, and Mac users might not be able to enjoy the speed benefits."
If you are the core developer/maintainer of this repo, please either confirm that no support is coming, act on the subject and write some code, or provide insights on where to make the necessary changes to potentially update Forge for Mac support. If not, let the actual developers speak for themselves on the subject.
I'm sorry if you felt attacked. I shouldn't have replied, since I don't have a solution and can't speak for the dev.
Here's all I know and I hope it's of more use to you than my last response:
It's my understanding that this repo is a GPU-focused version of A1111 with a lot of new NVidia features implemented, which come from the developments in CUDA since the release of the more stable version that A1111 uses. By now, A1111 could have incorporated much of it. That was the goal of the repo, I think; as a testbed for A1111.
It's my understanding that most, or all of the performance comes from implementing new Nvidia code, which is why there probably won't be any Mac-specific enhancements here, as it appears that all the enhancements in SF are specific to discreet AMD/Nvidia GPUs.
As for whether or not we'll ever see Mac-optimized code here, who knows? It certainly should, if he plans to develop for A1111 more. This is the guy who created ControlNet, after all, so perhaps he'll enjoy working with the limitations inherent in integrated on-chip graphics.
I also should mention that there have been no changes to the repo since I first cloned it several months back. The developer is MIA, or I never would have attempted to answer for him. You waited a week, yes. There are 260+ open issues, here in addition to yours. Some are very serious.
I've never seen such an amazing and promising project left to go stale like this before. It's a shame, but as a fellow SD user, I recommend you switch back to A1111. Hopefully, the active team over there can make use of the code and we'll see all the same improvements in time.
My main question remains: If A1111 runs fine, why can't Forge?
I'm clueless as to why they've resulted in a major speed hit for you, but have you compared all the settings to see which are set differently? Because maybe there are some settings enabled by default in SF (which favor Nvidia GPUs), that are causing SF's speed to tank? Just a thought. Honestly, I'd stick with A1111 for now.
Looking at the console...
"Legacy Preprocessor init warning: Unable to install insightface automatically. Please try run pip install insightface
manually."
"Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --attention-split"
"You are running torch 2.1.0. The program is tested to work with torch 2.1.2. To reinstall the desired version, run with commandline flag --reinstall-torch."
Those 3 things stand out to me as potential fixes. Did you try those?
I actually have the opposite issue as you, Forge runs amazingly and Auto1111 runs like... well, it doesn't. A gen that takes 43 seconds in Forge will take over 11 minutes to complete in A1111. I blame gradio bloat, but what do I know?
Also...
...added a model, a VAE, an some LoRas...
There appears to be a bug in which under certain circumstances with SDXL, adding a LoRa can absolutely trash generation speed. I find that in my cases, it's usually the first gen after loading a checkpoint. Given that you're using SDXL, this may apply? One quick and dirty fix I have found is to start a generation, then immediately cancel it. Subsequent generations USUALLY run fine, but that may also be limited to me and my machine setup. Could be that all those extra LoRas are really slowing it down? Have you tried without LoRas to start and see if it does run faster?
I also see it says that it took 63 seconds to move a model, and that seems incredibly slow. Seems there is some configuration issue going on there. I know people seem to be having ControlNet issues with Forge lately, and that may be part of it - I am also in that group though it isn't really affecting my performance, just what I can actually get done and how I have to do it.
Maybe start with the low hanging fruit (updating/trying console suggestions, etc) and see what that does? I did run on version 2.1.0 of torch for a long time without issues, but 2 completely different machines, setups, and architectures. Good luck.
I have the same issue on Macbook, really slow, I tried to update the torch and others, but seems not helping. But I saw others working well on Macbook in youtube. Not sure how to achieve that.
Checklist
What happened?
Txt2img is taking too long in Macbook Air M1. After running a fresh install of the git repository, added a model, a VAE, an some LoRas, the generation process takes almost 1 hour per image of 768x1152, with/without upscaling/refining. Is this normal?
Steps to reproduce the problem
On a fresh install of Forge WebUI in Macbook Air M1, using this configuration:
score_9, score_8_up, score_7_up, score_6_up,OverallDetail, 1girl, solo, (tiefling), very long hair, white hair, bangs, ponytail, braided hair, long pointed ear, black makeup, thin body, (white skin), thigh gap, tail tiefling, black gloves, erotic pose, (sexy red clothing, sexy black stockings), (provocative look) ,concept art,illustration,realistic,Expressiveh,knva,perfect body,highly detailed,delicate and smooth skin, body in motion,
Negative prompt: score_6, score_5, score_4, negativeXL_D, 3d
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 666, Size: 768x1152, Model hash: 67ab2fd8ec, Model: ponyDiffusionV6XL_v6StartWithThisOne, VAE hash: 2125bad8d3, VAE: xlVAEC_f1.safetensors, Denoising strength: 0.7, Hires upscale: 1.5, Hires steps: 5, Hires upscaler: Latent, Lora hashes: "add-detail-xl: 9c783c8ce46c, Concept Art Twilight Style SDXL_LoRA_Pony Diffusion V6 XL: e5fe96cd307b, Expressive_H: 5671f20a9a6b, Kenva: cfa45d23d34c, sinfully_stylish_SDKL: 076fa4d920a9, xl_more_art-full_v1: fe3b4816be83", TI hashes: "negativeXL_D: fff5d51ab655, negativeXL_D: fff5d51ab655, negativeXL_D: fff5d51ab655, negativeXL_D: fff5d51ab655", Version: f0.0.17v1.8.0rc-latest-276-g29be1da7
Time taken: 47 min. 26.2 sec.
What should have happened?
As far as I understand, long generation process shouldn't take longer than 5-10min. One hour seems excessive to me, but maybe I am missing something important here.
What browsers do you use to access the UI ?
Mozilla Firefox
Sysinfo
sysinfo.txt
Console logs
Additional information
No response