Closed zyy-fc closed 5 months ago
Hi, hopsize should be same as that used in audio codec when extacting pitch; for the problem of pitch predicting, we also find the problem, you can replace to predict torch.log(pitch_target+1), we will fix the problem soon.
Thank you!
if hopsize should be same, the config in gitlab may not be right.
why skip silence in extracting duration?
Due to the absence of a data preprocessing script corresponding to NaturalSpeech2, I reviewed the processing script of LibriTTS and found that when extracting duration, it is necessary to filter out the duration and phone corresponding to silence. However, this may result in a misalignment of num_frames between the duration and speech codec extracted from the audio codec. How can this issue be addressed?
hop_size for pitch and audio codec should be same?
when using "parselmouth" extracting pitch, hop_size should be same as that used in audio codec?
pitch_target_log is "inf" when pitch_target==0 in ns2_loss.py
pitch_target is the quantized value of pitch (ground-truth), but there will inevitably be a value of 0 (for silence). Directly using "torch.log(pitch_target)" would be ValueError. I want to know: is there any special design for pitch loss ?