Open rkfg opened 3 hours ago
That's odd, I haven't noticed such big differences between the attention modes yet.
The grain in the middle comes from the VAE tiling yeah, what you can do is use the Save latents -node to save the sampling results to disk, then load latents (have to move the saved latents to Comfy input -folder first) and try decoding with various settings, I haven't really found anything optimal but they make big difference to the seams.
Can also keep decoding without saving separately if you don't change anything before the decode node after sampling by just changing it's settings and re-queueing.
I installed SageAttention as they suggest with python -m pip install sageattention
in a docker container based on nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
. Some of the libraries that are relevant:
accelerate 1.0.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
antlr4-python3-runtime 4.9.3
async-timeout 4.0.3
attrs 24.2.0
audioread 3.0.1
bitsandbytes 0.44.1
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.4.0
color-matcher 0.5.0
contourpy 1.3.0
cupy-cuda12x 12.3.0
cupy-wheel 12.3.0
cycler 0.12.1
ddt 1.7.2
decorator 5.1.1
diffusers 0.31.0
diskcache 5.6.3
docutils 0.21.2
einops 0.8.0
fastrlock 0.8.2
filelock 3.16.1
flash-attn 2.6.3
...
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.6.77
nvidia-nvtx-cu12 12.1.105
...
sageattention 1.0.3
scikit-image 0.24.0
scikit-learn 1.5.2
scipy 1.14.1
sentencepiece 0.2.0
setuptools 59.6.0
six 1.16.0
soundfile 0.12.1
soxr 0.5.0.post1
spandrel 0.4.0
sympy 1.13.1
threadpoolctl 3.5.0
tifffile 2024.9.20
timm 1.0.11
tokenizers 0.20.1
torch 2.5.0+cu121
torchaudio 2.5.0+cu121
torchmetrics 1.5.1
torchsde 0.2.6
torchvision 0.20.0+cu121
Regarding the VAE, I tried 480x480x9 tiles, that's the biggest I could do without OOM, it takes all of my 24GB VRAM. I experimented a bit with these settings, no need to save anything as only the VAE node is re-run and unlike in PyramidFlow here I see no issues with that. The results are slightly different visually when I changed the settings, the batch size doesn't change anything or too little for me to notice. The tile size quickly blows VRAM so even 512x512 is too much.
If you use smaller tiles, you can use larger batch, and the overlap settings is mostly what would help with the seams.
First of all, thank you for your continuous efforts bringing the most recent models to CUI! It really helps to use them easily in my workflow.
I tried the different attention options, flash_attn works well, although setting it up is indeed a pain. The released binary didn't work in my docker setup due to a missed C++ symbol so I had to build it with pip and it's really slow and takes a lot of RAM during compilation with ninja. But in the end it worked. SDPA works as well, it's a bit slower but doesn't need any dependencies. SageAttention works noticeably faster (around 13%) but the result looks broken.
Prompt:
A bustling summer day at a vibrant theme park, where the sun shines brightly in a cloudless sky. Families and friends roam the park, holding colorful balloons and enjoying ice cream cones. The roller coaster tracks wind high above, their sleek metal glinting in the sunlight.
Seed is 0, the rest of parameters are default.Sage:
https://github.com/user-attachments/assets/366c40d6-f2a4-4225-871d-e28550f93f2c
Flash:
https://github.com/user-attachments/assets/c0ea7b41-4548-48ba-a20b-6bffd6a285b6
There's also some grain that's not present in the official Genmo demo, tiled VAE issue or something on my end?