Video-to-Audio (V2A) models has recently gained attention for generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. The authors present DIFFFOLEY, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. The authors adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture subtler audio-visual correlation via cross-attention layers. The authors also significantly improve sample quality through combining classifier-free guidance with CAVP discriminator guidance. DIFF-FOLEY achieves state-of-the-art V2A performance on a current large scale V2A dataset. Furthermore, the authors demonstrate the practical applicability and generalization capabilities of DIFF-FOLEY via downstream finetuning.
Open source status
[X] The model implementation is available
[X] The model weights are available (Only relevant if addition is not a scheduler).
Model/Pipeline/Scheduler description
Video-to-Audio (V2A) models has recently gained attention for generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. The authors present DIFFFOLEY, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. The authors adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture subtler audio-visual correlation via cross-attention layers. The authors also significantly improve sample quality through combining classifier-free guidance with CAVP discriminator guidance. DIFF-FOLEY achieves state-of-the-art V2A performance on a current large scale V2A dataset. Furthermore, the authors demonstrate the practical applicability and generalization capabilities of DIFF-FOLEY via downstream finetuning.
Open source status
Provide useful links for the implementation
Main Author: @luosiallen Code: https://github.com/luosiallen/Diff-Foley Paper: https://arxiv.org/pdf/2306.17203.pdf