open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.41k stars 373 forks source link

[BUG]-NaturalSpeech2 data preprocess & pitch loss #148

Closed zyy-fc closed 5 months ago

zyy-fc commented 6 months ago

why skip silence in extracting duration?

Due to the absence of a data preprocessing script corresponding to NaturalSpeech2, I reviewed the processing script of LibriTTS and found that when extracting duration, it is necessary to filter out the duration and phone corresponding to silence. However, this may result in a misalignment of num_frames between the duration and speech codec extracted from the audio codec. How can this issue be addressed?

hop_size for pitch and audio codec should be same?

when using "parselmouth" extracting pitch, hop_size should be same as that used in audio codec?

pitch_target_log is "inf" when pitch_target==0 in ns2_loss.py

pitch_target is the quantized value of pitch (ground-truth), but there will inevitably be a value of 0 (for silence). Directly using "torch.log(pitch_target)" would be ValueError. I want to know: is there any special design for pitch loss ?

HeCheng0625 commented 6 months ago

Hi, hopsize should be same as that used in audio codec when extacting pitch; for the problem of pitch predicting, we also find the problem, you can replace to predict torch.log(pitch_target+1), we will fix the problem soon.

zyy-fc commented 6 months ago

Thank you!

if hopsize should be same, the config in gitlab may not be right.