omerbt / TokenFlow

Official Pytorch Implementation for "TokenFlow: Consistent Diffusion Features for Consistent Video Editing" presenting "TokenFlow" (ICLR 2024)
https://diffusion-tokenflow.github.io
MIT License
1.52k stars 134 forks source link

VRAM Usage relative to n_frames #20

Closed DBarker774 closed 10 months ago

DBarker774 commented 10 months ago

Hi,

I am relatively new to the AI space so apologies if I am missing key information.

I've noticed that I am able to fully process videos that are under a certain number of n_frames.

For example I can successfully process 30 n_frames of a video however the more I increase the n_frames the more VRAM is required.

Does this process mean that all frames need to be loaded into VRAM, is it true the more frames the more VRAM you require or are there some optimizations needed for longer videos?

DBarker774 commented 10 months ago

I should note that I have tried to process a video in batches of 30 frames successfully however this introduces inconsistences along with the number of batches.

DBarker774 commented 10 months ago

Below is an example output of a video cut into 3 batches of 24 frames. Note the inconsistencies or jumps between batches which are very noticeable.

Running the full 450 frames of this video results in CUDA out of memory.

https://github.com/omerbt/TokenFlow/assets/27270800/c44581da-62dd-46e8-81ca-6ac2ca4217be

rakesh-reddy95 commented 10 months ago

Can you show the results after preprocessing?

DBarker774 commented 10 months ago

It's worth mentioning that I also tried processing the video using google collab with an A100 40GB and still ran out of memory.

I'm wondering if I am missing something when it comes to processing longer videos.

MichalGeyer commented 10 months ago

Hi there! Just to make sure -- the inconsistencies in your result come from treating the video as 3 different videos when running our method. You shouldn't see such inconsistencies if you were run our method on the full video.

In terms of memory, the main bottleneck is the computation of extended attention on the keyframes, which is a massive matrix multiplication. I think it can be lightened (at the expense of run time though) by adding more for loops instead of batch matrix multiplication in this computation : https://github.com/omerbt/TokenFlow/blob/06f51a0d0c19bef88f0b9b521146b5b849fbfb76/tokenflow_utils.py#L168C13-L168C16 It's currently written such that above 96 frames it loops over the frames to computes the cross-frame attention of each keyframe. Also it loops over the different attention heads. This was designed for our resources, and you can add a loop and loop over the dimension of the attention sequence_length.

For reference, I was able to run the method on 200+ frames using 48G gpu mem.

Hope this helps!

DBarker774 commented 10 months ago

Hi there! Just to make sure -- the inconsistencies in your result come from treating the video as 3 different videos when running our method. You shouldn't see such inconsistencies if you were run our method on the full video.

In terms of memory, the main bottleneck is the computation of extended attention on the keyframes, which is a massive matrix multiplication. I think it can be lightened (at the expense of run time though) by adding more for loops instead of batch matrix multiplication in this computation : https://github.com/omerbt/TokenFlow/blob/06f51a0d0c19bef88f0b9b521146b5b849fbfb76/tokenflow_utils.py#L168C13-L168C16 It's currently written such that above 96 frames it loops over the frames to computes the cross-frame attention of each keyframe. Also it loops over the different attention heads. This was designed for our resources, and you can add a loop and loop over the dimension of the attention sequence_length.

For reference, I was able to run the method on 200+ frames using 48G gpu mem.

Hope this helps!

Thank you for such a detailed reply. I am somewhat of a beginner but what you have mentioned makes sense.

You clarification is completely correct. This is not an issue with consistency of your method at all, more of inconsistencies introduced by my workaround to keep VRAM under control.

Unfortunately I do not have the skillset or know-how to make the updated for loops as you have suggested.

Just curious how you are able to have a card with 48gb memory?

I tried provisioning an a100 80gb from google but was denied as I'm not a business ahaha.

Big fan of your method and would love to put this into practice on higher resolution longer, videos.

DBarker774 commented 10 months ago

Closing this as it has largely been answered.