Closed wizardk closed 2 years ago
The main problem was that minimal values (e.g., 3.35227e-09) in the F0 contour. In cases where such values are replaced by 0, we could synthesize the speech by real-time synthesis.
In the offline synthesis, there is a safeguard to avoid this error. On the other hand, real-time synthesis does not have it because we prioritize the processing speed. If needed, please make another safeguard based on the expected lowest F0 of input speech. In the general case, 40 Hz is one solution.
The main problem was that minimal values (e.g., 3.35227e-09) in the F0 contour. In cases where such values are replaced by 0, we could synthesize the speech by real-time synthesis.
In the offline synthesis, there is a safeguard to avoid this error. On the other hand, real-time synthesis does not have it because we prioritize the processing speed. If needed, please make another safeguard based on the expected lowest F0 of input speech. In the general case, 40 Hz is one solution.
Thanks for your help. One more question, why choose 40 Hz? Is the corresponding code like this?
coarse_f0[i + synth->handoff] = f0[i] < 40 ? 0.0 : f0[i];
coarse_vuv[i + synth->handoff] = coarse_f0[i + synth->handoff] == 0.0 ? 0.0 : 1.0;
The evaluation of the F0 estimator generally uses the frequency range from 40 to 800 Hz. On the other hand, since this is a rough standard, you can set the value based on the assumed value. For example, you can select around 100 Hz when the input signal is only the female speech.
Your code seems good. However, I don't recommend you to modify the synthesisrealtime.cpp. If you get the f0 from another function, you recommend modifying the function.
The evaluation of the F0 estimator generally uses the frequency range from 40 to 800 Hz. On the other hand, since this is a rough standard, you can set the value based on the assumed value. For example, you can select around 100 Hz when the input signal is only the female speech.
Your code seems good. However, I don't recommend you to modify the synthesisrealtime.cpp. If you get the f0 from another function, you recommend modifying the function.
OK, thanks a lot.
@mmorise Hi, I have another question. The audio generated by my acoustic model and world vocoder is not very clear(a little stuffy). I called code_spectral_envelope(sp, 16000, 60) and code_aperiodicity(ap, 16000) for the dimension reduction of acoustic model output. How can I improve the quality of audio?Which factor has the greatest impact on sound quality? F0, SP or AP? Thank you in advance.
Generally, the timbre degradation is the spectral envelope. WORLD can synthesize the natural speech from speech parameters estimated from waveform in cases with no estimation error. I think that the number of dimensions of the spectral envelope seems enough for the high-quality synthesis. It isn't easy to discuss how to improve speech quality (e.g., The amount of data may not be enough. Unexpected factors often degrade the sound quality of WORLD).
Hi, when using this set of data, the non real-time results are correct, but the real-time results are wrong.
fs = 16000 frame_period = 5 fft_size = 1024 buffer_size = 64 number_of_pointers = 100 f0.txt sp.txt ap.txt