Closed Johnson-yue closed 2 months ago
Would you show me how to use this flash-attention in stable diffusion project ??
It is being used in the Stable Diffusion app "Draw Things" available for iOS and macOS devices.
If you look at the Swift source files in the "Tests" folder, it shows a reference implementation for opening libMetalFlashAttention.metallib
and encoding the commands into a MTLComputeCommandEncoder
.
But no document to say how to use it in python or pytorch code ?
This requires direct access to the Metal API. For an example, look at how Tinygrad uses Metal through PyObjC bindings.
Ideally, you would use Swift instead of Python, paired with a lower-overhead API than PyTorch. Part of this repository's speed comes from removing the MPS dependency and enabling full control over how commands are encoded into a MTLComputeCommandEncoder
. In addition, Python cannot run on iOS, where the majority of Apple GPUs are, but Swift can run.
It is being used in the Stable Diffusion app "Draw Things" available for iOS and macOS devices. This is what I want to do
I only know how to run python code with pytorch framework , but according to you said:
If you look at the Swift source files in the "Tests" folder, it shows a reference implementation for opening
libMetalFlashAttention.metallib
and encoding the commands into aMTLComputeCommandEncoder
.This requires direct access to the Metal API. For an example, look at how Tinygrad uses Metal through PyObjC bindings.
Ideally, you would use Swift instead of Python, paired with a lower-overhead API than PyTorch. Part of this repository's speed comes from removing the MPS dependency and enabling full control over how commands are encoded into a
MTLComputeCommandEncoder
. In addition, Python cannot run on iOS, where the majority of Apple GPUs are, but Swift can run.
I should to do :
MTLComputeCommandEncoder
in Swift code.
Is right ?Yeah, that's a lot to learn, and may or may not be right for your situation. There are some major performance drawbacks to choosing Python and PyTorch in general. Especially the PyTorch + MPS combo. I might suggest looking at how Tinygrad calls into Metal from Python, as long as you don't ever plan to run this on iOS.
Apple is very restrictive about iOS, preventing not just Python but also JIT compilation, OpenCL, and many things taken for granted on desktop. To use iOS devices, developers have to either use Swift, its very antiquated predecessor Objective-C, or some recent C++ bindings. Metal C++ might be something to consider as well.
convert from Python+Pytorch to Swift + Metal is huge project, Maybe I can just learn a little bit , or ignore performance just work is fine. Thanks your suggestion, I will try to learn Tinygrad first.
Can I open this issue, until someone can do it ?
@philipturner I experienced the great results of your project on Draw Things. Does it apply to Apple's official project?https://github.com/apple/ml-stable-diffusion
It makes Draw Things faster than Apple’s ML Stable Diffusion, especially on the more powerful chips. Draw Things is now the fastest SD implementation in existence for Apple platforms.
It makes Draw Things faster than Apple’s ML Stable Diffusion, especially on the more powerful chips. Draw Things is now the fastest SD implementation in existence for Apple platforms.
But as far as i know, Apple's ML Stable Diffusion runs on neural engine, which is a more powerful hardware than gpu for computation. Does that mean the acceleration of flash attention algorithm exceeds the advantage of hardware?
The Apple Neural Engine can't always be utilized to its fullest. For example, from MPSGraph, it can only be used at 25% its advertised FLOPS for generic GEMM. I believe it is a cache bandwidth bottleneck. Only convolutions, which are more information-dense, can achieve 15 TFLOPS FP16 / 30 TFLOPS FP8.
FlashAttention uses locality and flexibility of local SRAM, plus nearby availability of transcendental math units, to accelerate attention modules. That closes the gap between the GPU and Neural Engine. On iPhone and M1, the ANE was still slightly faster. On other machines, ANE was slower. We could also further close the gap using Winograd, which decreases the compute cost of convolutions, but just was never implemented very effectively on the GPU.
Low-end chips:
High-end chips:
GPU models also take much less time to compile. CoreML's compiler takes minutes to recompile the model, every time you adjust image width or change one of many parameters. This makes it unsuitable for many common tasks, and iPhone users would end up falling back to GPU for very involved work. It is critical to get the GPU backend as fast as possible.
if someone wants to look into bridging python to this metal code, the new mlx.core.fast.metal_kernel
may help.
I would rather use the PyObjC bindings used by Tinygrad. They provide more low-level control, which is very important for a repo like this. For example, setting the maximum number of threads per threadgroup. That requires access to the “MTLComputePipelineStateDescriptor”.
However, I worry about a CPU-side bottleneck in IRL benchmarks. The Swift compiler mode “-Xswiftc -Ounchecked” compiles down to (vectorized) assembly language for parts that compare GPU results to a reference implementation. Python does not compile down to optimized assembly on the CPU.
Hi ,thank you for implement flash-attention in MPS , it can be run flash-attention on Mac . But no document to say how to use it in python or pytorch code ?
I want to use it to speed up stable diffusion model inference time on Mac . I know that run Stable diffusion model on Mac M2 is convert pytorch weight to Coreml, and it only run but can not enable edit any code..
Would you show me how to use this flash-attention in stable diffusion project ??