SageAttention produces very different results

rkfg commented 3 hours ago

First of all, thank you for your continuous efforts bringing the most recent models to CUI! It really helps to use them easily in my workflow.

I tried the different attention options, flash_attn works well, although setting it up is indeed a pain. The released binary didn't work in my docker setup due to a missed C++ symbol so I had to build it with pip and it's really slow and takes a lot of RAM during compilation with ninja. But in the end it worked. SDPA works as well, it's a bit slower but doesn't need any dependencies. SageAttention works noticeably faster (around 13%) but the result looks broken.

Prompt: A bustling summer day at a vibrant theme park, where the sun shines brightly in a cloudless sky. Families and friends roam the park, holding colorful balloons and enjoying ice cream cones. The roller coaster tracks wind high above, their sleek metal glinting in the sunlight. Seed is 0, the rest of parameters are default.

Sage:

https://github.com/user-attachments/assets/366c40d6-f2a4-4225-871d-e28550f93f2c

Flash:

https://github.com/user-attachments/assets/c0ea7b41-4548-48ba-a20b-6bffd6a285b6

There's also some grain that's not present in the official Genmo demo, tiled VAE issue or something on my end?

kijai commented 2 hours ago

That's odd, I haven't noticed such big differences between the attention modes yet.

The grain in the middle comes from the VAE tiling yeah, what you can do is use the Save latents -node to save the sampling results to disk, then load latents (have to move the saved latents to Comfy input -folder first) and try decoding with various settings, I haven't really found anything optimal but they make big difference to the seams.

Can also keep decoding without saving separately if you don't change anything before the decode node after sampling by just changing it's settings and re-queueing.

rkfg commented 2 hours ago

I installed SageAttention as they suggest with python -m pip install sageattention in a docker container based on nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04. Some of the libraries that are relevant:

accelerate               1.0.1                                                                                                                                                        
aiohappyeyeballs         2.4.3                                                                                                                                                        
aiohttp                  3.10.10                                                                                                                                                      
aiosignal                1.3.1                                                                                                                                                        
antlr4-python3-runtime   4.9.3                                                                                                                                                        
async-timeout            4.0.3                                                                                                                                                        
attrs                    24.2.0                                                                                                                                                       
audioread                3.0.1                                                                                                                                                        
bitsandbytes             0.44.1                                                                                                                                                       
certifi                  2024.8.30                                                                                                                                                    
cffi                     1.17.1                                                                                                                                                       
charset-normalizer       3.4.0                                                                                                                                                        
color-matcher            0.5.0                                                                                                                                                        
contourpy                1.3.0                                                                                                                                                        
cupy-cuda12x             12.3.0                                                                                                                                                       
cupy-wheel               12.3.0                                                                                                                                                       
cycler                   0.12.1                                                                                                                                                       
ddt                      1.7.2                                                                                                                                                        
decorator                5.1.1                                                                                                                                                        
diffusers                0.31.0                                                                                                                                                       
diskcache                5.6.3                                                                                                                                                        
docutils                 0.21.2                                                                                                                                                       
einops                   0.8.0                                                                                                                                                        
fastrlock                0.8.2    
filelock                 3.16.1   
flash-attn               2.6.3
...
numpy                    1.26.4   
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105   
nvidia-cuda-runtime-cu12 12.1.105   
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106  
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.6.77
nvidia-nvtx-cu12         12.1.105
...
sageattention            1.0.3
scikit-image             0.24.0
scikit-learn             1.5.2
scipy                    1.14.1
sentencepiece            0.2.0
setuptools               59.6.0
six                      1.16.0
soundfile                0.12.1
soxr                     0.5.0.post1
spandrel                 0.4.0
sympy                    1.13.1
threadpoolctl            3.5.0
tifffile                 2024.9.20
timm                     1.0.11
tokenizers               0.20.1
torch                    2.5.0+cu121
torchaudio               2.5.0+cu121
torchmetrics             1.5.1
torchsde                 0.2.6
torchvision              0.20.0+cu121

Regarding the VAE, I tried 480x480x9 tiles, that's the biggest I could do without OOM, it takes all of my 24GB VRAM. I experimented a bit with these settings, no need to save anything as only the VAE node is re-run and unlike in PyramidFlow here I see no issues with that. The results are slightly different visually when I changed the settings, the batch size doesn't change anything or too little for me to notice. The tile size quickly blows VRAM so even 512x512 is too much.

kijai commented 2 hours ago

If you use smaller tiles, you can use larger batch, and the overlap settings is mostly what would help with the seams.

kijai / ComfyUI-MochiWrapper

SageAttention produces very different results #15