omerbt / TokenFlow

Official Pytorch Implementation for "TokenFlow: Consistent Diffusion Features for Consistent Video Editing" presenting "TokenFlow" (ICLR 2024)
https://diffusion-tokenflow.github.io
MIT License
1.52k stars 134 forks source link

Required GPU memory depends on the video length. #31

Open ysig opened 10 months ago

ysig commented 10 months ago

I've managed to run run_tokenflow_pnp.py for a small excerpt of my video (5s) - and it looks really cool - but when I run it on the full one (5min) it crashes with CUDA OOM error even when I drop the batch size down to 1.

This scaling dependence on the video length probably caused by the extended attention seems like a major limitation of the method and is not highlighted neither in the discussion section nor somewhere else in the paper (as far as I can tell).

Is it possible to offload part of the attention computation to the CPU so that the number of frames is not a bottleneck?

eps696 commented 9 months ago

that's exactly what i did here https://github.com/omerbt/TokenFlow/issues/32 (in a way). it handled longer sequences but not unlimited ones, as required to feed that whole attention data back to GPU before denoising latents step (and i didn't manage to make it feedable in batches on that step)