luosiallen / Diff-Foley

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
Apache License 2.0
147 stars 15 forks source link

About the generalization ability #4

Closed auzxb closed 10 months ago

auzxb commented 10 months ago

Thanks for your work and the high-quality code. I've tried testing my own silent videos, but found that many examples failed to generate sound or the sound was very weak. I wonder if you have encountered a similar situation? Whether it is necessary to continue fine-tuning in specific scenarios to improve generalization ability.

luosiallen commented 10 months ago

what types of video you are using.

auzxb commented 10 months ago

what types of video you are using.

There are some videos from YouTube, and some videos shot by myself, such as applauding in front of the camera, or turning on the faucet in the kitchen. In actual evaluation, it is very normal that the training set cannot cover many scenes. In order to get some better results, can you provide some guidance, such as the type of video, the parameters of the video, etc. Thank you so much.

luosiallen commented 10 months ago

You can refer to the types of VGGSound dataset. It is the current biggest V2A dataset (however it still has only ~200k audio-video pairs), and Diff-Foley is trained on this dataset. To achieve better performance on generalization, I suggest you may try to set the double-guidance as False (cause this classifier also trained on VGGSound), and use larger CFG scale like 7.5. This might lead to better performance.