how to use this flash-attention in python code ?

Johnson-yue commented 1 year ago

Hi ,thank you for implement flash-attention in MPS , it can be run flash-attention on Mac . But no document to say how to use it in python or pytorch code ?

I want to use it to speed up stable diffusion model inference time on Mac . I know that run Stable diffusion model on Mac M2 is convert pytorch weight to Coreml, and it only run but can not enable edit any code..

Would you show me how to use this flash-attention in stable diffusion project ??

philipturner commented 1 year ago

Would you show me how to use this flash-attention in stable diffusion project ??

It is being used in the Stable Diffusion app "Draw Things" available for iOS and macOS devices.

If you look at the Swift source files in the "Tests" folder, it shows a reference implementation for opening libMetalFlashAttention.metallib and encoding the commands into a MTLComputeCommandEncoder.

But no document to say how to use it in python or pytorch code ?

This requires direct access to the Metal API. For an example, look at how Tinygrad uses Metal through PyObjC bindings.

Ideally, you would use Swift instead of Python, paired with a lower-overhead API than PyTorch. Part of this repository's speed comes from removing the MPS dependency and enabling full control over how commands are encoded into a MTLComputeCommandEncoder. In addition, Python cannot run on iOS, where the majority of Apple GPUs are, but Swift can run.

Johnson-yue commented 1 year ago

It is being used in the Stable Diffusion app "Draw Things" available for iOS and macOS devices. This is what I want to do

I only know how to run python code with pytorch framework , but according to you said:

If you look at the Swift source files in the "Tests" folder, it shows a reference implementation for opening libMetalFlashAttention.metallib and encoding the commands into a MTLComputeCommandEncoder.

This requires direct access to the Metal API. For an example, look at how Tinygrad uses Metal through PyObjC bindings.

Ideally, you would use Swift instead of Python, paired with a lower-overhead API than PyTorch. Part of this repository's speed comes from removing the MPS dependency and enabling full control over how commands are encoded into a MTLComputeCommandEncoder. In addition, Python cannot run on iOS, where the majority of Apple GPUs are, but Swift can run.

I should to do :

learn how to run Swift code on Mac
learn how to run stable diffusion without pytorch but with Swift
learn how to call Metal API in Swift code
learn how to call MTLComputeCommandEncoderin Swift code. Is right ?

philipturner commented 1 year ago

Yeah, that's a lot to learn, and may or may not be right for your situation. There are some major performance drawbacks to choosing Python and PyTorch in general. Especially the PyTorch + MPS combo. I might suggest looking at how Tinygrad calls into Metal from Python, as long as you don't ever plan to run this on iOS.

Apple is very restrictive about iOS, preventing not just Python but also JIT compilation, OpenCL, and many things taken for granted on desktop. To use iOS devices, developers have to either use Swift, its very antiquated predecessor Objective-C, or some recent C++ bindings. Metal C++ might be something to consider as well.

Johnson-yue commented 1 year ago

convert from Python+Pytorch to Swift + Metal is huge project, Maybe I can just learn a little bit , or ignore performance just work is fine. Thanks your suggestion, I will try to learn Tinygrad first.

Can I open this issue, until someone can do it ?

czkoko commented 1 year ago

@philipturner I experienced the great results of your project on Draw Things. Does it apply to Apple's official project?https://github.com/apple/ml-stable-diffusion

philipturner commented 1 year ago

It makes Draw Things faster than Apple’s ML Stable Diffusion, especially on the more powerful chips. Draw Things is now the fastest SD implementation in existence for Apple platforms.

xyyNicole commented 11 months ago

It makes Draw Things faster than Apple’s ML Stable Diffusion, especially on the more powerful chips. Draw Things is now the fastest SD implementation in existence for Apple platforms.

But as far as i know, Apple's ML Stable Diffusion runs on neural engine, which is a more powerful hardware than gpu for computation. Does that mean the acceleration of flash attention algorithm exceeds the advantage of hardware?

philipturner commented 11 months ago

The Apple Neural Engine can't always be utilized to its fullest. For example, from MPSGraph, it can only be used at 25% its advertised FLOPS for generic GEMM. I believe it is a cache bandwidth bottleneck. Only convolutions, which are more information-dense, can achieve 15 TFLOPS FP16 / 30 TFLOPS FP8.

FlashAttention uses locality and flexibility of local SRAM, plus nearby availability of transcendental math units, to accelerate attention modules. That closes the gap between the GPU and Neural Engine. On iPhone and M1, the ANE was still slightly faster. On other machines, ANE was slower. We could also further close the gap using Winograd, which decreases the compute cost of convolutions, but just was never implemented very effectively on the GPU.

https://engineering.drawthings.ai/integrating-metal-flashattention-accelerating-the-heart-of-image-generation-in-the-apple-ecosystem-16a86142eb18

Low-end chips:

Regular attention: faster on ANE
FlashAttention: faster on GPU or equal to ANE
Feedforward: ANE only faster than GPU for supermassive matrices
Convolution: faster on ANE, but potentially equal on GPU with proper Winograd

High-end chips:

GPU always faster

GPU models also take much less time to compile. CoreML's compiler takes minutes to recompile the model, every time you adjust image width or change one of many parameters. This makes it unsuitable for many common tasks, and iPhone users would end up falling back to GPU for very involved work. It is critical to get the GPU backend as fast as possible.

tmm1 commented 2 months ago

if someone wants to look into bridging python to this metal code, the new mlx.core.fast.metal_kernel may help.

https://x.com/awnihannun/status/1827087059431125004?s=61

https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.fast.metal_kernel.html#mlx.core.fast.metal_kernel

philipturner commented 2 months ago

I would rather use the PyObjC bindings used by Tinygrad. They provide more low-level control, which is very important for a repo like this. For example, setting the maximum number of threads per threadgroup. That requires access to the “MTLComputePipelineStateDescriptor”.

However, I worry about a CPU-side bottleneck in IRL benchmarks. The Swift compiler mode “-Xswiftc -Ounchecked” compiles down to (vectorized) assembly language for parts that compare GPU results to a reference implementation. Python does not compile down to optimized assembly on the CPU.

philipturner / metal-flash-attention

how to use this flash-attention in python code ? #2