luosiallen / Diff-Foley

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
Apache License 2.0
143 stars 15 forks source link

Is the checkpoint of "Diff-Foley/diff_foley/modules/cond_stage/video_feat_encoder.py" essential?? #26

Open HUIZ-A opened 3 months ago

HUIZ-A commented 3 months ago

Thanks for your works! I'm confusing about the video_feat_encoder class, which is used in "Diff-Foley/evaluation/config/eval_classifier.yaml" for evaluation. This encoder is a nn.module, operating nn.linear to change the tensor shape. I'm wondering whether this video_feat_encoder had been trained or the embedding network params is not quite neccessary so I can just use the initialized params

luosiallen commented 3 months ago

Yes. It has been trained. You should use the pretrained weight.

HUIZ-A commented 3 months ago

Yes. It has been trained. You should use the pretrained weight.

Is the pretrained weight provided? It seems not be uploaded at https://huggingface.co/SimianLuo/Diff-Foley/tree/main/diff_foley_ckpt.

HUIZ-A commented 3 months ago

Yes. It has been trained. You should use the pretrained weight. @luosiallen

I think I've figured it out, the eval_classifier.ckpt includes the params of classifier backbone and video_feat_encoder, but the first stage model is not included. Is that right?

In "Diff-Foley/diff_foley/modules/double_guidance/alignment_classifier_metric.py", the first_stage_ckpt is individually loaded: `class Alignment_Classifier_metric(pl.LightningModule):

def __init__(self,
             classifier_config,
             first_stage_config,
             cond_stage_config,
             monitor,
             first_stage_ckpt=None,
             first_stage_key="spec",
             scale_factor = 1.0,
             timesteps = 2,
             given_betas=None,
             beta_schedule = "linear",
             linear_start=1e-4,
             linear_end=2e-2,
             cosine_s=8e-3,
             v_posterior=0.,
             parameterization="eps",
             *args, **kwargs):

    super().__init__()

    self.instantiate_first_stage(first_stage_config)
    self.first_stage_ckpt = first_stage_ckpt
    if self.first_stage_ckpt is not None:
        self.init_first_from_ckpt(self.first_stage_ckpt)` 
Angelalilyer commented 2 months ago

Hello! May I ask what is the result of your evaluation? After using "video_feat_encoder. py", almost all values are close to 0. If not used, the accuracy of evaluating vggsound is 0.16. Obviously, this is not correct~ T T

HUIZ-A commented 2 days ago

Hello! May I ask what is the result of your evaluation? After using "video_feat_encoder. py", almost all values are close to 0. If not used, the accuracy of evaluating vggsound is 0.16. Obviously, this is not correct~ T T

@Angelalilyer my acc is about 0.8 for my variant model and a 100 samples eval subset,