Closed LeoniusChen closed 1 month ago
We did try autoregressive generation. Specifically, you can first generate 12 frames for a 2 second audio, then use the last generated frame as input, together with the next 2-second audio, to generate again. By design, you can iteratively generate longer videos by conditioning on longer audio in this way.
However, like many other autoregressive methods, this leads to error accumulation: since the generated frames are usually in lower quality than groundtruth frames, the error will be accumulated to next generation, so on so forth.
In practice, we did try this way for longer video generation. As you can see below, while the generated video content is always well synchronized with input audio, the image quality decreases, due to error accumulation:
https://github.com/user-attachments/assets/f9b5e7a8-99a4-48d6-8bda-d44997ba0a49
Hope this helps!
Nice! It would be incredibly helpful if you could release the code related to iterative generation and add an option when generating videos. This would greatly facilitate future works, allowing for easier comparison.
Hi Leonius,
Very sorry that I am currently overburdened by another project and do not have time to add more features for the code, unless really necessary. The current released codebase is cleaned. But the video above is produced by our previous uncleaned code. They are not quite compatible.
To implement this feature, you can divide audio into multiple 2-second chunks, and extract the last frame of each generation using cv2 or torchvision.io, then generate multiple times, which is not too complicated if you need quantitative comparison.
Hope you can understand. Thank you!
Thanks for the reply. I understand the constraints. Have a nice day!
Great work! I have been testing the model and noticed that audio clips longer than 2 seconds are truncated, resulting in the generated videos that are still only 12 frames (2 seconds * 6 frames per second), consistent with the training setup of using 12 frames.
I want to ask if it is possible to generate videos that match the length of the audio or longer than 12 frames in inference?