Open ysig opened 1 year ago
that's exactly what i did here https://github.com/omerbt/TokenFlow/issues/32 (in a way). it handled longer sequences but not unlimited ones, as required to feed that whole attention data back to GPU before denoising latents step (and i didn't manage to make it feedable in batches on that step)
I've managed to run
run_tokenflow_pnp.py
for a small excerpt of my video (5s) - and it looks really cool - but when I run it on the full one (5min) it crashes with CUDA OOM error even when I drop the batch size down to 1.This scaling dependence on the video length probably caused by the extended attention seems like a major limitation of the method and is not highlighted neither in the discussion section nor somewhere else in the paper (as far as I can tell).
Is it possible to offload part of the attention computation to the CPU so that the number of frames is not a bottleneck?