luosiallen / Diff-Foley

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
Apache License 2.0
147 stars 15 forks source link

Why is Mel hop_len different for preprocess and training? #14

Closed cyanbx closed 3 months ago

cyanbx commented 6 months ago

Hi, thanks for sharing your great work. I'm a little confused with the mel hop length, which is 250 in data_preprocess but 256 in the dataset for training. However, when I change the hop_len param of audio_video_spec_fullset_Dataset to 256, I get the following error in diffusion forward:

2024-03-14 21:54:13.900 File "Diff-Foley/training/stage2_ldm/adm/modules/diffusionmodules/openai_unetmodel.py", line 736, in forward

2024-03-14 21:54:13.900 h = th.cat([h, hs.pop()], dim=1)

2024-03-14 21:54:13.900 RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 18 but got size 17 for tensor number 1 in the list.

Any help with it? Thanks a lot.

kxgong commented 5 months ago

I also met a similar problem in training.

Diff-Foley/training/stage2_ldm/adm/modules/diffusionmodules/openai_unetmodel.py", line 744, in forward
    h = th.cat([h, hs.pop()], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 3 for tensor number 1 in the list.
luosiallen commented 3 months ago

hey. Thanks for mentioning. For Stage2 training and inference, we use hop_len 256. For Stage1 training and inference, we use 250. This is for the purpose for temporal alignment.