Why LDM with visual instead of audio features?

luosiallen / Diff-Foley

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Apache License 2.0

147 stars 15 forks source link

Why LDM with visual instead of audio features? #29

Open sivannavis opened 3 months ago

sivannavis commented 3 months ago

Hi, thanks for the inspiring work. I have a simple question about the pipeline, why do you choose to train LDM with visual features from CAVP but not audio features, since they are supposed to be aligned, and the latter enables unsupervised training similar to audioLDM? Could you please offer some insights on this? Thank you so much!