DIFF-FOLEY: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

clarencechen commented 1 year ago

Model/Pipeline/Scheduler description

Video-to-Audio (V2A) models has recently gained attention for generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. The authors present DIFFFOLEY, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. The authors adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture subtler audio-visual correlation via cross-attention layers. The authors also significantly improve sample quality through combining classifier-free guidance with CAVP discriminator guidance. DIFF-FOLEY achieves state-of-the-art V2A performance on a current large scale V2A dataset. Furthermore, the authors demonstrate the practical applicability and generalization capabilities of DIFF-FOLEY via downstream finetuning.