lzhangbj / ASVA

[ECCV 2024 Oral] Audio-Synchronized Visual Animation
https://lzhangbj.github.io/projects/asva/asva.html
34 stars 0 forks source link

Longer Video Generation #3

Closed LeoniusChen closed 1 month ago

LeoniusChen commented 1 month ago

Great work! I have been testing the model and noticed that audio clips longer than 2 seconds are truncated, resulting in the generated videos that are still only 12 frames (2 seconds * 6 frames per second), consistent with the training setup of using 12 frames.

I want to ask if it is possible to generate videos that match the length of the audio or longer than 12 frames in inference?

lzhangbj commented 1 month ago

We did try autoregressive generation. Specifically, you can first generate 12 frames for a 2 second audio, then use the last generated frame as input, together with the next 2-second audio, to generate again. By design, you can iteratively generate longer videos by conditioning on longer audio in this way.

However, like many other autoregressive methods, this leads to error accumulation: since the generated frames are usually in lower quality than groundtruth frames, the error will be accumulated to next generation, so on so forth.

In practice, we did try this way for longer video generation. As you can see below, while the generated video content is always well synchronized with input audio, the image quality decreases, due to error accumulation:

https://github.com/user-attachments/assets/f9b5e7a8-99a4-48d6-8bda-d44997ba0a49

Hope this helps!

LeoniusChen commented 1 month ago

Nice! It would be incredibly helpful if you could release the code related to iterative generation and add an option when generating videos. This would greatly facilitate future works, allowing for easier comparison.

lzhangbj commented 1 month ago

Hi Leonius,

Very sorry that I am currently overburdened by another project and do not have time to add more features for the code, unless really necessary. The current released codebase is cleaned. But the video above is produced by our previous uncleaned code. They are not quite compatible.

To implement this feature, you can divide audio into multiple 2-second chunks, and extract the last frame of each generation using cv2 or torchvision.io, then generate multiple times, which is not too complicated if you need quantitative comparison.

Hope you can understand. Thank you!

LeoniusChen commented 1 month ago

Thanks for the reply. I understand the constraints. Have a nice day!