rhymes-ai / Allegro

Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input.
https://rhymes.ai/
Apache License 2.0
291 stars 14 forks source link

Is there a strict requirement for GPUs that support flash_attention? #17

Open feng20001022 opened 20 hours ago

feng20001022 commented 20 hours ago

Is there a strict requirement for GPUs that support flash_attention? I tried to test on V100, but this GPU does not support flash_attention, resulting in an error with the Runtime Error: No available kernel Aborting execution.

/Allegro/allegro/models/transformers/block.py:824: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:723.) hidden_states = F.scaled_dot_product_attention( /Allegro/allegro/models/transformers/block.py:824: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:495.) hidden_states = F.scaled_dot_product_attention( /Allegro/allegro/models/transformers/block.py:824: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:725.) hidden_states = F.scaled_dot_product_attention( /Allegro/allegro/models/transformers/block.py:824: UserWarning: Flash attention only supports gpu architectures in the range [sm80, sm90]. Attempting to run on a sm 7.0 gpu. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:201.) hidden_states = F.scaled_dot_product_attention( /Allegro/allegro/models/transformers/block.py:824: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:727.) hidden_states = F.scaled_dot_product_attention( /Allegro/allegro/models/transformers/block.py:824: UserWarning: The CuDNN backend needs to be enabled by setting the enviornment variableTORCH_CUDNN_SDPA_ENABLED=1 (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:496.)

feng20001022 commented 20 hours ago

I solve this problem by change "with sdpa_kernel(SDPBackend.FLASH_ATTENTION)" (line 824 Allegro/allegro/models/transformers /block.py) to "with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True)" , which ensure flash_attention is false.

nightsnack commented 20 hours ago

No there is not. Feel free to modify the attention processor

feng20001022 commented 19 hours ago

A new problem. It shows it requires 560.82 GiB to test after I change my code as shown above. And nothing has changed even though enable_cpu_offload is set to True.

File "/Allegro/allegro/models/transformers/block.py", line 826, in call hidden_states = F.scaled_dot_product_attention( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 560.82 GiB. GPU 0 has a total capacity of 31.74 GiB of which 26.35 GiB is free. Process 2048906 has 5.38 GiB memory in use. Of the allocated memory 4.81 GiB is allocated by PyTorch, and 218.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

nightsnack commented 19 hours ago

What? 560G? It seems some wired things appeared in V100. I remember I tested xformers on A100 and the memory cost remained the same. 2yu6-hqnkypq9935219 We don't have V100 and I'm afraid there's nothing I can do about it, unfortunately..

feng20001022 commented 19 hours ago

I found the issue. The V100 does not support bfloat16 precision, but it doesn't throw an error. The underlying implementation might default to some very complex computations. After I switched to float16 precision, it ran successfully, using 6 GiB on a single GPU. However, generating a result takes about 4 hours, so I guess I need to use faster GPUs. :)

Grownz commented 16 hours ago

How do you switch precision modes?

feng20001022 commented 16 hours ago

How do you switch precision modes? Just simply change the 13th line "dtype=torch. bfloat16" in single_inference. py to "dtype=torch. float16"

Grownz commented 14 hours ago

didn't work either way, but thank you anyways :)

Grownz commented 14 hours ago

I solve this problem by change "with sdpa_kernel(SDPBackend.FLASH_ATTENTION)" (line 824 Allegro/allegro/models/transformers /block.py) to "with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True)" , which ensure flash_attention is false.

This did work. But it is brutally slow (GTX 3090). grafik grafik grafik