thu-ml / SageAttention

Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
BSD 3-Clause "New" or "Revised" License
177 stars 5 forks source link

SageAttention on ComfyUI #11

Open blepping opened 3 days ago

blepping commented 3 days ago

i made a very simple ComfyUI node to replace the attention implementation with SageAttention: https://gist.github.com/blepping/fbb92a23bc9697976cc0555a0af3d9af

seems like a decent performance improvement on SDXL. SageAttention seems to fail when k/v aren't the same shape as q (on attn2 which i believe is cross-attention).

For SD15, none of the head sizes are currently supported so it doesn't do anything. not sure if you are interested in supporting SD15 (or SDXL cross-attentions). if any more information would be helpful, please let me know.

you can close this issue, just thought i would post this in case anyone wanted to try it with ComfyUI.

note: it's not a normal model patch, so to enable/or disable, make sure the node runs. simply bypassing or removing it won't work correctly.

wardensc2 commented 1 day ago

Hi @blepping

I already install like you said and got the node working but so far the speed is still the same, I test both SDXL and Flux with image size 1024x1024, i'm not sure whether the node work or not because the speed is the same. I get this notice when image finished generated: image

Can you give me some examples json files to check whether this node work or not

Thank you

blepping commented 1 day ago

@wardensc2 thanks for giving it a try. i don't think there's really a way to do it wrong in the workflow. image

attention improvements seem to make the most difference on large images. i didn't test with Flux (not sure if it uses the same kind of attentions or has compatible sizes). for my tests with SDXL, i got 8.94s/it with PyTorch attention and 6.71s/it using 4096x4096 resolution on a 4060Ti (about a 25% speed increase). the difference might not be big enough to see at small resolutions like 1024x1024. (think i might have been testing with smooth_k disabled - it didn't seem necessary with SDXL and should be a bit faster.)