Strange triton(windows) + SageAttention behavior

Ratinod commented 2 weeks ago

I have a strange situation... After a lot of wasted time, I managed to install triton (windows) and SageAttention. (https://github.com/kijai/ComfyUI-CogVideoXWrapper/issues/150) Yes, it became faster... But the result eventually became garbage (colored cubes)... And with TORA the result is just black nothing. Maybe someone knows where the problem might be? There are no errors in the console, and it is not clear what the problem is.

Python version: 3.11.8
pytorch version: 2.4.0+cu124
xformers version: 0.0.27.post2
flash_attn-2.6.3+cu123torch2.4.0cxx11abiFALSE-cp311-cp311-win_amd64
triton-3.1.0-cp311-cp311-win_amd64

https://github.com/user-attachments/assets/24adfe8a-17b1-485e-bbd1-3fda346b69de

Ratinod commented 2 weeks ago

Bonus: How to automatically embed any image in 720x480 size in the middle with black borders for CogVideoX_5b_I2V: embed_image

kijai commented 2 weeks ago

Which GPU? I haven't had that exact issue, but I would highly recommend torch 2.5.1 especially for torch.compile. Also try with torch.compile and sage attention individually to see which is the problem.

Ratinod commented 2 weeks ago

Which GPU? I haven't had that exact issue, but I would highly recommend torch 2.5.1 especially for torch.compile. Also try with torch.compile and sage attention individually to see which is the problem.

GPU: RTX 4070 ti super

I decided to update and ran update_comfyui_and_python_dependencies.bat. This installed torch 2.5.1 But in the end it broke a lot of things. It was hard to come up with a more or less working combination of everything. + I still couldn't get it to compile without errors, for example "flash_attn" and I have to look for .whl for windows... and naturally there is no whl (win) for torch 2.5.1

torch.compile

I tried it out of curiosity. It was very swearing that the file in the TEMP directory already exists or something like that. I haven't tried it separately. Damn. so many variables. I really don't want to make a separate Comfyui instance (but most likely it is inevitable)...

Ratinod commented 2 weeks ago

Tested torch.compile without sage attention It's works, but without it the generation is faster.

Ratinod commented 2 weeks ago

torch.compile + sage attention works (if turn on it from start). Not faster, give same colored cubes.

How long should the compilation take when everything works correctly? torch.compile + sage attention compiles very quickly (<10 sec) and in the console:

The library main.lib and the object main.exp are created.
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'q_kernel_per_block_int8' for 'sm_89'
ptxas info    : Function properties for q_kernel_per_block_int8
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 91 registers, 384 bytes cmem[0]
main.c
   The library main.lib and the object main.exp are created.
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'k_kernel_per_block_int8' for 'sm_89'
ptxas info    : Function properties for k_kernel_per_block_int8
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 384 bytes cmem[0]
ptxas info    : 11 bytes gmem, 8 bytes cmem[4]
ptxas info    : Compiling entry function '_attn_fwd' for 'sm_89'
ptxas info    : Function properties for _attn_fwd
    8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info    : Used 255 registers, 8 bytes cumulative stack size, 460 bytes cmem[0], 8 bytes cmem[2]
main.c
   The library main.lib and the object main.exp are created.

torch.compile alone compiles longer (~1min) and and there is more text in the console

wswszhys commented 2 weeks ago

me too, windows, torch 2.4.0 cuda 12.1, triton 3.1 , sageattention, colored cubes, rtx 3060 12g laptop, gguf fun model

kijai commented 2 weeks ago

Which GPU? I haven't had that exact issue, but I would highly recommend torch 2.5.1 especially for torch.compile. Also try with torch.compile and sage attention individually to see which is the problem.

GPU: RTX 4070 ti super

I decided to update and ran update_comfyui_and_python_dependencies.bat. This installed torch 2.5.1 But in the end it broke a lot of things. It was hard to come up with a more or less working combination of everything. + I still couldn't get it to compile without errors, for example "flash_attn" and I have to look for .whl for windows... and naturally there is no whl (win) for torch 2.5.1

torch.compile

I tried it out of curiosity. It was very swearing that the file in the TEMP directory already exists or something like that. I haven't tried it separately. Damn. so many variables. I really don't want to make a separate Comfyui instance (but most likely it is inevitable)...

Flash_attn isn't really that useful, no use with these nodes and can't think of anything that currently requires it, it is possible to install on torch 2.5.1 but that takes a while, the torch update is far more useful and important in my opinion, especially for compile. The temp file issue is a bug in torch compile on Windows, there's a workaround fix that I applied on the mochi nodes but haven't done that here since I never ran into the bug myself, not sure if it was fixed in 2.5.1 as it was present in 2.5.0 still.

Not getting speed increase might just be about the GPU too, I have no experience with anything but 3090 and 4090 for compiling myself. Torch 2.5.0 definitely still made a huge difference to compile times especially.

Ratinod commented 2 weeks ago

model: CogVideoX_5b_I2V_GGUF_Q4_0 Get fresh https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.2.6 ( 2.5.0+cu124 + Python version: 3.12.7) pytorch version: 2.5.0+cu124 + triton 3.1 + sageattention + "compile:disabled" -> colored cubes pytorch version: 2.5.1+cu124 + triton 3.1 + sageattention + "compile:disabled" -> colored cubes

pytorch version: 2.5.1+cu124 + triton 3.1 + "compile:torch" -> FileExistsError

FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\Username\\AppData\\Local\\Temp\\torchinductor_Username\\cache\\.6224.2484.tmp' -> 'C:\\Users\\Username\\AppData\\Local\\Temp\\torchinductor_Username\\cache\\18566357848df3845af69495a202d25f7e8827ce8df43c2d216c7cb41cd90baa'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

FileExistsError fixed with https://github.com/pytorch/pytorch/issues/138211#issuecomment-2422975123 pytorch version: 2.5.1+cu124 + triton 3.1 + "compile:torch" -> works, but speed is the same pytorch version: 2.5.1+cu124 + triton 3.1 + sageattention + "compile:torch" -> colored cubes

Maybe everything breaks at the "CogVideo Decode" stage due to "enable_vae_tiling:true" but I can't check it (OOM)

Ratinod commented 2 weeks ago

Mochi colored cubes

https://github.com/user-attachments/assets/61cd40f9-c1f2-452a-9f5d-52cc81e692db

Ratinod commented 2 weeks ago

The solution to the problem has been found https://github.com/woct0rdho/triton-windows/issues/3#issuecomment-2453138155

The problem occurs when using "ClipLoader (GGUF)" with "t5-v1_1-xxl-encoder-Q8_0.gguf" instead of "Load CLIP" with "t5xxl_fp8_e4m3fn.safetensors" The only question is why gguf clip doesn't work well and can it be fixed? (And on whose side is the fix required?)

Tora works too.

Mochi works too.

wswszhys commented 2 weeks ago

已找到问题的解决方案 woct0rdho/triton-windows#3 (评论)

问题出现在使用“ClipLoader (GGUF)”和“t5-v1_1-xxl-encoder-Q8_0.gguf”而不是“Load CLIP”和“t5xxl_fp8_e4m3fn.safetensors”时。唯一的问题是为什么 gguf clip 不能正常工作，可以修复吗？（谁需要修复？）

Tora 也有效。

麻糬也行。

Right, i resolved it.

wswszhys commented 2 weeks ago

After put sageattention code at beginning of gguf's code, i had rersolved it, i'll try on cogvideox.

kijai / ComfyUI-CogVideoXWrapper

Strange triton(windows) + SageAttention behavior #200