huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
24.3k stars 5.01k forks source link

DIFF-FOLEY: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models #5760

Open clarencechen opened 8 months ago

clarencechen commented 8 months ago

Model/Pipeline/Scheduler description

Video-to-Audio (V2A) models has recently gained attention for generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. The authors present DIFFFOLEY, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. The authors adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture subtler audio-visual correlation via cross-attention layers. The authors also significantly improve sample quality through combining classifier-free guidance with CAVP discriminator guidance. DIFF-FOLEY achieves state-of-the-art V2A performance on a current large scale V2A dataset. Furthermore, the authors demonstrate the practical applicability and generalization capabilities of DIFF-FOLEY via downstream finetuning.

Open source status

Provide useful links for the implementation

Main Author: @luosiallen Code: https://github.com/luosiallen/Diff-Foley Paper: https://arxiv.org/pdf/2306.17203.pdf

patrickvonplaten commented 8 months ago

Would be great to add it as a community pipeline: https://github.com/huggingface/diffusers/tree/main/examples/community

luosiallen commented 8 months ago

Thanks Patrick! @patrickvonplaten