kijai / ComfyUI-CogVideoXWrapper

728 stars 41 forks source link

Strange triton(windows) + SageAttention behavior #200

Open Ratinod opened 6 hours ago

Ratinod commented 6 hours ago

I have a strange situation... After a lot of wasted time, I managed to install triton (windows) and SageAttention. (https://github.com/kijai/ComfyUI-CogVideoXWrapper/issues/150) Yes, it became faster... But the result eventually became garbage (colored cubes)... And with TORA the result is just black nothing. Maybe someone knows where the problem might be? There are no errors in the console, and it is not clear what the problem is.

Python version: 3.11.8
pytorch version: 2.4.0+cu124
xformers version: 0.0.27.post2
flash_attn-2.6.3+cu123torch2.4.0cxx11abiFALSE-cp311-cp311-win_amd64
triton-3.1.0-cp311-cp311-win_amd64

https://github.com/user-attachments/assets/24adfe8a-17b1-485e-bbd1-3fda346b69de

Ratinod commented 5 hours ago

Bonus: How to automatically embed any image in 720x480 size in the middle with black borders for CogVideoX_5b_I2V: embed_image

kijai commented 4 hours ago

Which GPU? I haven't had that exact issue, but I would highly recommend torch 2.5.1 especially for torch.compile. Also try with torch.compile and sage attention individually to see which is the problem.

Ratinod commented 3 hours ago

Which GPU? I haven't had that exact issue, but I would highly recommend torch 2.5.1 especially for torch.compile. Also try with torch.compile and sage attention individually to see which is the problem.

GPU: RTX 4070 ti super

I decided to update and ran update_comfyui_and_python_dependencies.bat. This installed torch 2.5.1 But in the end it broke a lot of things. It was hard to come up with a more or less working combination of everything. + I still couldn't get it to compile without errors, for example "flash_attn" and I have to look for .whl for windows... and naturally there is no whl (win) for torch 2.5.1

torch.compile

I tried it out of curiosity. It was very swearing that the file in the TEMP directory already exists or something like that. I haven't tried it separately. Damn. so many variables. I really don't want to make a separate Comfyui instance (but most likely it is inevitable)...

Ratinod commented 3 hours ago

Tested torch.compile without sage attention It's works, but without it the generation is faster.

Ratinod commented 3 hours ago

torch.compile + sage attention works (if turn on it from start). Not faster, give same colored cubes.

How long should the compilation take when everything works correctly? torch.compile + sage attention compiles very quickly (<10 sec) and in the console:

The library main.lib and the object main.exp are created.
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'q_kernel_per_block_int8' for 'sm_89'
ptxas info    : Function properties for q_kernel_per_block_int8
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 91 registers, 384 bytes cmem[0]
main.c
   The library main.lib and the object main.exp are created.
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'k_kernel_per_block_int8' for 'sm_89'
ptxas info    : Function properties for k_kernel_per_block_int8
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 384 bytes cmem[0]
ptxas info    : 11 bytes gmem, 8 bytes cmem[4]
ptxas info    : Compiling entry function '_attn_fwd' for 'sm_89'
ptxas info    : Function properties for _attn_fwd
    8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info    : Used 255 registers, 8 bytes cumulative stack size, 460 bytes cmem[0], 8 bytes cmem[2]
main.c
   The library main.lib and the object main.exp are created.

torch.compile alone compiles longer (~1min) and and there is more text in the console

wswszhys commented 1 hour ago

me too, windows, torch 2.4.0 cuda 12.1, triton 3.1 , sageattention, colored cubes, rtx 3060 12g laptop, gguf fun model