Hi, thanks for the inspiring work. I have a simple question about the pipeline, why do you choose to train LDM with visual features from CAVP but not audio features, since they are supposed to be aligned, and the latter enables unsupervised training similar to audioLDM? Could you please offer some insights on this? Thank you so much!
Hi, thanks for the inspiring work. I have a simple question about the pipeline, why do you choose to train LDM with visual features from CAVP but not audio features, since they are supposed to be aligned, and the latter enables unsupervised training similar to audioLDM? Could you please offer some insights on this? Thank you so much!