Closed DBarker774 closed 1 year ago
I should note that I have tried to process a video in batches of 30 frames successfully however this introduces inconsistences along with the number of batches.
Below is an example output of a video cut into 3 batches of 24 frames. Note the inconsistencies or jumps between batches which are very noticeable.
Running the full 450 frames of this video results in CUDA out of memory.
https://github.com/omerbt/TokenFlow/assets/27270800/c44581da-62dd-46e8-81ca-6ac2ca4217be
Can you show the results after preprocessing?
It's worth mentioning that I also tried processing the video using google collab with an A100 40GB and still ran out of memory.
I'm wondering if I am missing something when it comes to processing longer videos.
Hi there! Just to make sure -- the inconsistencies in your result come from treating the video as 3 different videos when running our method. You shouldn't see such inconsistencies if you were run our method on the full video.
In terms of memory, the main bottleneck is the computation of extended attention on the keyframes, which is a massive matrix multiplication. I think it can be lightened (at the expense of run time though) by adding more for loops instead of batch matrix multiplication in this computation : https://github.com/omerbt/TokenFlow/blob/06f51a0d0c19bef88f0b9b521146b5b849fbfb76/tokenflow_utils.py#L168C13-L168C16 It's currently written such that above 96 frames it loops over the frames to computes the cross-frame attention of each keyframe. Also it loops over the different attention heads. This was designed for our resources, and you can add a loop and loop over the dimension of the attention sequence_length.
For reference, I was able to run the method on 200+ frames using 48G gpu mem.
Hope this helps!
Hi there! Just to make sure -- the inconsistencies in your result come from treating the video as 3 different videos when running our method. You shouldn't see such inconsistencies if you were run our method on the full video.
In terms of memory, the main bottleneck is the computation of extended attention on the keyframes, which is a massive matrix multiplication. I think it can be lightened (at the expense of run time though) by adding more for loops instead of batch matrix multiplication in this computation : https://github.com/omerbt/TokenFlow/blob/06f51a0d0c19bef88f0b9b521146b5b849fbfb76/tokenflow_utils.py#L168C13-L168C16 It's currently written such that above 96 frames it loops over the frames to computes the cross-frame attention of each keyframe. Also it loops over the different attention heads. This was designed for our resources, and you can add a loop and loop over the dimension of the attention sequence_length.
For reference, I was able to run the method on 200+ frames using 48G gpu mem.
Hope this helps!
Thank you for such a detailed reply. I am somewhat of a beginner but what you have mentioned makes sense.
You clarification is completely correct. This is not an issue with consistency of your method at all, more of inconsistencies introduced by my workaround to keep VRAM under control.
Unfortunately I do not have the skillset or know-how to make the updated for loops as you have suggested.
Just curious how you are able to have a card with 48gb memory?
I tried provisioning an a100 80gb from google but was denied as I'm not a business ahaha.
Big fan of your method and would love to put this into practice on higher resolution longer, videos.
Closing this as it has largely been answered.
Hi,
I am relatively new to the AI space so apologies if I am missing key information.
I've noticed that I am able to fully process videos that are under a certain number of n_frames.
For example I can successfully process 30 n_frames of a video however the more I increase the n_frames the more VRAM is required.
Does this process mean that all frames need to be loaded into VRAM, is it true the more frames the more VRAM you require or are there some optimizations needed for longer videos?