huawei-noah / Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.
545 stars 113 forks source link

How is `out_size` in `params` determined #16

Closed cantabile-kwok closed 2 years ago

cantabile-kwok commented 2 years ago

Hi, I am modifying the code for my own purposes. I notice here: https://github.com/huawei-noah/Speech-Backbones/blob/b82fdd546d9d977573c8557f242b06a0770ece8e/Grad-TTS/params.py#L53 the argument is hard-coded, and I guess 22050 and 256 to be the sampling rate and frame shift in the case of LJspeech, right? If this is true, should I change to another value if I am dealing with different datasets?

ivanvovk commented 2 years ago

Hi, @cantabile-kwok! Yes, you got everything right. out_size parameter just controls the length of mel-spec segment to be used to train the diffusion decoder. In our case, for LJSpeech we used 2sec segments. In mel-spec resolution it corresponds to $2 \times (\text{sampling rate = 22050}) // (\text{hop length = 256})$ frames. We introduced this parameter to better fit GPU memory requirements. So, if you want to train your Grad-TTS with your own dataset on Nsec segments, you should set out_size parameter to $N \times (\text{your sampling rate}) // (\text{your hop length})$. fix_len_compatibility function just adjusts final length to the valid one (to match feature map resolutions during upsampling/downsampling of U-Net).

cantabile-kwok commented 2 years ago

@ivanvovk Yeah, I see. Thanks for the detailed reply!