huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
23.98k stars 4.93k forks source link

Memory-efficient attention (without xformers) #1892

Open Birch-san opened 1 year ago

Birch-san commented 1 year ago

I implemented sub-quadratic attention (as described in https://arxiv.org/abs/2112.05682v2):
https://twitter.com/Birchlabs/status/1607503573906063362
https://github.com/Birch-san/diffusers/pull/1
https://github.com/Birch-san/diffusers-play/commit/a573e3d9ea4fdacfdee7ddd5eecdac29b236fc00

is this worth upstreaming? it enables creation of images larger than can be achieved with attention slicing.

patrickvonplaten commented 1 year ago

Hey @Birch-san,

Thanks a lot for the issue! @patil-suraj what do you think?

patil-suraj commented 1 year ago

Very cool @Birch-san , is this more efficient than xformers ? Also, xformers installation situation is getting better now cf https://pypi.org/project/xformers/#history, so not sure if we need another efficient attention for PT. This could be a good addition to flax.

Birch-san commented 1 year ago

this implements the same paper as xformers memory efficient attention.
it's unlikely to be more efficient than xformers, since they have the advantage of custom CUDA kernels.

xformers is CUDA-only, I presume? no support for MPS or ROCm or Metal or CPU backends?

there are Mac users trying to run stable-diffusion on Mac Minis with 8GB of unified memory. IIRC they couldn't even fit 512x512 images. sliced attention helped with that, but this goes further: you can chunk up attention far finer, arbitrarily so.

if my calculations are correct: a 2048x2048 image's self-attention can require 80GB VRAM ordinarily. setting sliced attention to its most aggressive setting can get this down to 40GB slices (MPS backend refuses to allocate this). but the chunked attention implementation I've provided here can get it down to anything you like, e.g. 80MB chunks.

Lime-Cakes commented 1 year ago

This memory efficient attention might not be faster than xformer implementation, but since it doesn't rely on custom CUDA kernel, this means better support for non-CUDA device, which would be good for mac and other non-cuda accelerator on pytorch.

patrickvonplaten commented 1 year ago

@pcuenca could we maybe run some tests for MPS for this?

xformers will soon be natively supported in PyTorch so I'm wondering how important this for PyTorch-only. I definitely see an important use case for MPS though.

Also with the new attention processor system it should be relatively easy to add either way.

patil-suraj commented 1 year ago

Good point regarding MPS @Birch-san ! In that case, it would be cool to have this. cc @pcuenca

keturn commented 1 year ago

There's another implementation here: https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/ldm/modules/sub_quadratic_attention.py#L153

comfy reports:

I tweaked the sub-quadratic optimization when I implemented it in my own UI and it gave me a nice speed boost. but I had to tweak it first cause by default it didn't give better performance than the split optimization on my 6800XT

pcuenca commented 1 year ago

Hi @Birch-san, sorry for being so slow to react here! Is there any chance you could submit a PR that applies these optimizations to mps? That way it would be easier for us to test and discuss. Perhaps you can use the new attention processor mechanism so people can opt in (or make it default maybe) if they are running on mps. If you can't then I'll test your branch :)

Another question I have is: will this become obsolete with the upcoming memory-efficient attention integrated in PyTorch 2.x? Or does that not work for mps at all?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

keturn commented 1 year ago

Still interested, Stalebot!

Related:

Birch-san commented 1 year ago

yup, planning to submit PRs for memory-efficient attention and cross-attention masks soon.

patrickvonplaten commented 1 year ago

PyTorch 2.0 will provide memory efficient attention out of the box - is this PR still relevant then?

Birch-san commented 1 year ago

will PyTorch 2.0 provide memory-efficient attention for Mac?

Birch-san commented 1 year ago

@patrickvonplaten @pcuenca

https://pytorch.org/docs/2.0/generated/torch.nn.functional.scaled_dot_product_attention.html

the docs for memory-efficient attention link to xformers, which to my knowledge does not support MPS (it's focused on CUDA and triton).

the pytorch context manager for activating memory-efficient attention is torch.backends.cuda.enable_mem_efficient_sdp(), which again is CUDA-specific.

so I think there's still a case for implementing this: for Mac, iOS, ROCm, CPU. it also means that if you trace the torchscript operations: you'd get a memory-efficient torchscript model, which you could convert to CoreML†.

† _if you wanted the CoreML model to be optimal for Neural Engine: you'd need to reshape the algorithm a bit; Neural Engine prefers B,C,1,S tensors over B,S,C tensors._

pcuenca commented 1 year ago

@Birch-san Thanks for your interesting thoughts!

The documentation you linked to above also mentions a C++ implementation that I understand should be used when not targeting the cuda backend. I haven't had the chance to test it yet, but I'm planning to and will report back here.

Re: Core ML, if I'm not mistaken Apple's conversion code does the shape transformation when converting. We could explore whether it makes sense to support it in the diffusers codebase.

Birch-san commented 1 year ago

@pcuenca

Re: Core ML, if I'm not mistaken Apple's conversion code does the shape transformation when converting.

regardless of whether you target GPU or ANE: Apple traces their own bespoke UNet2DConditionModel:
https://github.com/apple/ml-stable-diffusion/blob/2c4e9de73c9e723de264356f9563706ea9104212/python_coreml_stable_diffusion/torch2coreml.py#L690-L723

Apple's bespoke UNet2DConditionModel changes every Linear layer into a Conv2D and modifies every LayerNorm, to keep tensors in [B,C,1,S] format. this is regardless of which ATTENTION_IMPLEMENTATION_IN_EFFECT is selected, and regardless of whether you target GPU or ANE.

their model offers an ATTENTION_IMPLEMENTATION_IN_EFFECT parameter, which just toggles whether sliced attention is used (to save memory — at the expense of speed — by serializing attention matmuls on batch dimension). they recommend this mode for memory-constrained devices.

my prediction: if you just want to target GPU, the default diffusers Unet would be faster (because [B,S,C] tensors are preferred by GPU -- they have a batch dimension with which you can do baddbmm() and bmm() matmuls).

also: I notice they don't fuse the * scale into the matmul. maybe CoreML is smart enough to do that for them, but if it's not: they're leaving a 18% speed boost on the table.

IIRC their coremltools implements baddbmm() support by unfusing the operation back into a matmul and a multiply, so I'm not sure whether the answer is as simple as "replace einsum() * scale with badbmm()".

if CoreML fundamentally lacks support for baddbmm(), or lacks support for automatically fusing multiplies into matmuls: there's a cheeky backdoor you can use to get a fused multiply: by burning the * scale into the projection weights of the model.

We could explore whether it makes sense to support it in the diffusers codebase.

supporting two different different possible tensor shapes throughought the entire Unet algorithm is hard to do in a clean way.

my recommendation is to un-break diffusers' support for PyTorch's _register_load_state_dict_pre_hook() idiom. this was the low-touch technique Apple originally used to modify BERT in their ANE-optimized transformers whitepaper.
https://github.com/huggingface/diffusers/issues/1880#issuecomment-1369291702

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Birch-san commented 1 year ago

Reopening; we still have a need pure-pytorch memory-efficient attention on systems such as Mac. I'm a bit tied up trying to get cross-attention bias (https://github.com/huggingface/diffusers/pull/2634) over the line, but still hoping to get round to upstreaming my memory-efficient attention implementation in the coming weeks.

pcuenca commented 1 year ago

@Birch-san sounds great!

williamberman commented 1 year ago

JW, how hard would it be for us to instead implement as a native mps kernel in either pytorch or if there's a separate library cuda toolkit equivalent that apple maintains (and if so is it opensource?). That would be the ideal way to support instead of merging into diffusers and later going through a deprecation cycle once there's official support

Birch-san commented 1 year ago

hmm I'm not aware of any more Mac-specific library implementation of this available.

as for whether to implement it as a native MPS kernel… well, what it does is relatively simple (it's expressible in pure PyTorch, so one could look up what underlying MPS operations that compiles down to and cram it into a kernel somehow).

MPS/Metal programming isn't my wheelhouse though, and I definitely don't see myself getting the time to learn how to write, then write, one of those.

as for the broader Mac story (CoreML export): if you wanted an access pattern optimized for Neural Engine: you'd probably want to tweak it to use ([batch, channels, 1, tokens] tensors, with tokens being contiguous and aligned to 64 bytes.
yet, whilst that's good for Neural Engine: it's probably not optimal for GPU. so you'd kinda want both options.

anyway, regardless of what the answer is for Mac: there's other backends that would benefit from a pure-pytorch implementation of memory-efficient attention. like ROCm, and CPU. and maybe backend-agnostic export formats like TorchScript or ONNX? for example for targeting WebGPU.

even CUDA users could still find this useful, despite already having two IO-aware memory-efficient kernels for attention. because neither of those bespoke kernels support forward-mode autograd. I understand from EleutherAI Discord that @crowsonkb saw use-cases for an attention implementation that supported forward-mode autograd, and right now the only way to do that is probably pure-PyTorch. so doing a memory-efficient version of that in pure-pytorch would still be useful.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

TeutonJon78 commented 11 months ago

This and/or other non-CUDA optimizations would also be helpful for DirectML users since we need every bit of extra VRAM savings.

Beinsezii commented 4 months ago

For non-Nvidia folk I rebased https://github.com/Birch-san/diffusers/tree/subquad_attn to the diffusers master branch over at my own fork https://github.com/Beinsezii/diffusers/tree/attn_subquad

On my 7900 XTX having the query chunk at 2^12 and the kv chunk at 2^15 I can process 1024 images slightly faster than any other attention method currently working on AMD. Additionally with the memory savings I successfully ran an > 8Mpx image using tiled VAE + subquad, where previously I don't even think I could reach half of that.

The ported code is so old it doesn't have masking or a fix for the upcast softmax so it's not a magic bullet. Notably it doesn't work with the 1.5 VAE. It's probably possible to yoink a more updated subquad impl from one of the other diffusion UIs as a hacky fix for those issues, but that's a task for another day.

Beinsezii commented 4 months ago

I updated my fork to a newer subquad impl with fixed upcast and masking, using a modified XFormers attn class since to my untrained eye they seem to function the same. Seems to work on all models now.

The updated attention function sourced from https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/ldm/modules/sub_quadratic_attention.py The repo is licensed under GPL 3 while the file header says its MIT so I'm not sure which it would be, ergo I don't think I should open a PR.

bghira commented 3 months ago

that file is MIT licensed code, as it can be applied per-file in a GPL3 project.

tzayuan commented 3 months ago

Hi, @Birch-san, @Beinsezii

I would like to ask you: if my model is trained based on the xformers-based attention operator. Is it feasible for me to modify the model to a torch based attention implementation while still using the previously trained model, i.e. avoiding import xformers during inference? Thanks.

bghira commented 3 months ago

@Birch-san there is metal-flash-attention but i don't know how to use it. how would we integrate that here?

Birch-san commented 3 months ago

@tzayuan yes, a model trained with xformers-based attention can be modified to use torch sdp attention. no need to import xformers, no need to retrain. one thing to be aware of is that:
xformers expects [batch, seq, heads, head_channels] permutation, whereas
torch sdp expects […batch, seq, head_channels] permutation.
in other words: xformers will do the permute for you, whereas torch expects you to permute the head dimensions to follow the batch dimensions. this will make it a little more fiddly to switch to the torch operation.

@bghira not simple at all. I'm not actually sure whether the PyTorch MPS backend can invoke custom Metal kernels. but assuming that's possible, you'd need to contribute (in the PyTorch source code, probably in C++ and Python) an MPS backend for (the forward pass of) scaled_dot_product_attention. you'd bring philip's sources into the PyTorch source tree (tell the build system to compile and link Mac targets against them), and introduce some kind of adapter from the PyTorch domain (i.e. tensors / MPS memory) to the Metal domain (however that works). then write tests, and get it reviewed & merged by the PyTorch team.

tzayuan commented 3 months ago

Hi, @Birch-san Thanks for your suggestion, I have finish it.