invoke-ai / InvokeAI

Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
https://invoke-ai.github.io/InvokeAI/
Apache License 2.0
23.36k stars 2.4k forks source link

[enhancement]: Integrate Apple's CoreML optimizations for SD #1676

Open timdesrochers opened 1 year ago

timdesrochers commented 1 year ago

Is there an existing issue for this?

Contact Details

No response

What should this feature add?

Apple has announced some tooling for SD optimization on M* Macs. If we can get it integrated quickly (or first...), it could be a huge boon to Invoke's reach.

https://machinelearning.apple.com/research/stable-diffusion-coreml-apple-silicon

https://github.com/apple/ml-stable-diffusion

Still reading, but I wanted to raise the issue with this community.

Alternatives

No response

Aditional Content

No response

pbakaus commented 1 year ago

just looked a little into this - looks really promising, but some issues as of now:

I'd love to see this in action here. Hope my digging helps a bit, and hope their repo advanced quickly to fix these shortcomings.

ioma8 commented 1 year ago

Please add this!

victorca25 commented 1 year ago

converted models do not seem to be able to accept any other width/height input other than 512x512 (or whatever it was originally trained on. see https://github.com/apple/ml-stable-diffusion/blob/main/python_coreml_stable_diffusion/pipeline.py#L225). That looks like a blocker until they fix it.

I have no experience with CoreML, but at least from the documentation, this doesn't seem like a major issue, it can even be modified after the model has been converted. The process seems very similar to tracing a model with PyTorch JIT.

The last point will probably be more tedious, since it kinda nullifies the speed gains. Will try to understand a bit better to see how hard it is to integrate, just out of curiosity.

whosawhatsis commented 1 year ago

I've been doing a lot of testing with Apple's repo. Here are the modes that worked best on my M1 Max with 24-core gpu and 32GB:

Using --attention-implementation ORIGINAL with CPU_AND_GPU, I got up to 2.96it/s using ~32W and ~8GB of RAM. This combination should be best for maximum performance on Pro/Max/Ultra.

Using --attention-implementation SPLIT_EINSUM with CPU_AND_NE, I got 1.44its/s using <5W and <1GB of RAM. This combination makes a great high-efficiency mode, and should even get similar performance on an iphone/ipad with an A14 or better.

Using --attention-implementation SPLIT_EINSUM with ALL (cpu, gpu and neural engine), I got 2.46 it/s using ~13W and ~3GB of RAM, and not maxing-out my GPU. This combination makes a good balanced mode, and will likely offer the highest performance on M1/M2 chips, and possible the M1 Pro as well, since it has 2/3 the gpu cores of my M1 Max.

I'll also note that the versions using the neural engine took about 4 minutes to initially load the model, though the Swift implementation was able to do this only on the first run, and load it more quickly after that. Asitop seemed to indicate that the neural engine wasn't running anywhere near maximum generating 512x512 images, but the code in Apple's repo doesn't let you modify that. That combined with the almost nonexistent memory footprint makes me thing this might work really well for generating larger images with hires_fix, or using large tiles with embiggen.

Ideally, Invoke would support these three modes (power, efficiency, and balanced), and let the user choose among them, depending on which processor they have, and how much power they want to use (having my laptop act as a lap warmer in the current weather is kinda nice, but if I was on battery power, I'd definitely want to use the neural engine instead of my GPU.

spaceexperiment commented 1 year ago

@whosawhatsis

Using --attention-implementation SPLIT_EINSUM with ALL (cpu, gpu and neural engine), I got 2.46 it/s using ~13W and ~3GB of RAM, and not maxing-out my GPU. This combination makes a good balanced mode, and will likely offer the highest performance on M1/M2 chips, and possible the M1 Pro as well, since it has 2/3 the gpu cores of my M1 Max.

I tried this but I am getting 8gb+ ram usage. I used this command, am I doing something wrong?

python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_split_einsum_packages -o output/ --compute-unit ALL --seed 93

I have this in models folder coreml-stable-diffusion-v1-4_split_einsum_packages

rovo79 commented 1 year ago

just looked a little into this - looks really promising, but some issues as of now:

  • models have to be converted in order to work. Not a big deal, but worth calling out..
  • converted models do not seem to be able to accept any other width/height input other than 512x512 (or whatever it was originally trained on. see https://github.com/apple/ml-stable-diffusion/blob/main/python_coreml_stable_diffusion/pipeline.py#L225). That looks like a blocker until they fix it.
  • The python generation pipeline currently has to load the model from scratch every time (2-3 mins!) and is unable to cache it. Their FAQ describes it in more detail. There's a swift pipeline that can avoid it, but not sure that helps here..

I'd love to see this in action here. Hope my digging helps a bit, and hope their repo advanced quickly to fix these shortcomings.

Thanks for the link. That led me to the hugging face pipeline. They show a good example there defining a call to allow for flexible height and width: https://github.com/huggingface/diffusers/blob/4125756e88e82370c197fecf28e9f0b4d7eee6c3/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L412

        height: Optional[int] = None,
        width: Optional[int] = None,
wzxu commented 1 year ago

just looked a little into this - looks really promising, but some issues as of now:

  • models have to be converted in order to work. Not a big deal, but worth calling out..
  • converted models do not seem to be able to accept any other width/height input other than 512x512 (or whatever it was originally trained on. see https://github.com/apple/ml-stable-diffusion/blob/main/python_coreml_stable_diffusion/pipeline.py#L225). That looks like a blocker until they fix it.
  • The python generation pipeline currently has to load the model from scratch every time (2-3 mins!) and is unable to cache it. Their FAQ describes it in more detail. There's a swift pipeline that can avoid it, but not sure that helps here..

I'd love to see this in action here. Hope my digging helps a bit, and hope their repo advanced quickly to fix these shortcomings.

Both the 2nd and 3rd points only apply to the SPLIT_EINSUM version of converted models (which works with CPU+ANE). For ORIGINAL version (which works with CPU+GPU) it's possible to change the width/height and model loading is fast.

Apart from the base M1/M2 where ANE outperforms GPU, ORIGINAL version works better for Pro/Max/Ultra. Some more benchmarks can be found on Apple's project page, and in PromptToImage's project page here.

Btw Apple recently updated its project to support even ControlNet, and MochiDiffusion already added support for it. Yay for competition? Anyway the future of Stable Diffusion on Apple Silicon looks really promising. Can't wait for SDXL!