zyy-fc commented 6 months ago

why skip silence in extracting duration?

Due to the absence of a data preprocessing script corresponding to NaturalSpeech2, I reviewed the processing script of LibriTTS and found that when extracting duration, it is necessary to filter out the duration and phone corresponding to silence. However, this may result in a misalignment of num_frames between the duration and speech codec extracted from the audio codec. How can this issue be addressed?

hop_size for pitch and audio codec should be same?

when using "parselmouth" extracting pitch, hop_size should be same as that used in audio codec?

pitch_target_log is "inf" when pitch_target==0 in ns2_loss.py

pitch_target is the quantized value of pitch (ground-truth), but there will inevitably be a value of 0 (for silence). Directly using "torch.log(pitch_target)" would be ValueError. I want to know: is there any special design for pitch loss ?

HeCheng0625 commented 6 months ago

Hi, hopsize should be same as that used in audio codec when extacting pitch; for the problem of pitch predicting, we also find the problem, you can replace to predict torch.log(pitch_target+1), we will fix the problem soon.

zyy-fc commented 6 months ago

Thank you!

if hopsize should be same, the config in gitlab may not be right.

open-mmlab / Amphion

[BUG]-NaturalSpeech2 data preprocess & pitch loss #148

why skip silence in extracting duration?

hop_size for pitch and audio codec should be same?

pitch_target_log is "inf" when pitch_target==0 in ns2_loss.py