mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.18k stars 253 forks source link

realtime bug #132

Closed wizardk closed 2 years ago

wizardk commented 2 years ago

Hi, when using this set of data, the non real-time results are correct, but the real-time results are wrong.

fs = 16000 frame_period = 5 fft_size = 1024 buffer_size = 64 number_of_pointers = 100 f0.txt sp.txt ap.txt

mmorise commented 2 years ago

The main problem was that minimal values (e.g., 3.35227e-09) in the F0 contour. In cases where such values are replaced by 0, we could synthesize the speech by real-time synthesis.

In the offline synthesis, there is a safeguard to avoid this error. On the other hand, real-time synthesis does not have it because we prioritize the processing speed. If needed, please make another safeguard based on the expected lowest F0 of input speech. In the general case, 40 Hz is one solution.

wizardk commented 2 years ago

The main problem was that minimal values (e.g., 3.35227e-09) in the F0 contour. In cases where such values are replaced by 0, we could synthesize the speech by real-time synthesis.

In the offline synthesis, there is a safeguard to avoid this error. On the other hand, real-time synthesis does not have it because we prioritize the processing speed. If needed, please make another safeguard based on the expected lowest F0 of input speech. In the general case, 40 Hz is one solution.

Thanks for your help. One more question, why choose 40 Hz? Is the corresponding code like this?

    coarse_f0[i + synth->handoff] = f0[i] < 40 ? 0.0 : f0[i];
    coarse_vuv[i + synth->handoff] = coarse_f0[i + synth->handoff] == 0.0 ? 0.0 : 1.0;
mmorise commented 2 years ago

The evaluation of the F0 estimator generally uses the frequency range from 40 to 800 Hz. On the other hand, since this is a rough standard, you can set the value based on the assumed value. For example, you can select around 100 Hz when the input signal is only the female speech.

Your code seems good. However, I don't recommend you to modify the synthesisrealtime.cpp. If you get the f0 from another function, you recommend modifying the function.

wizardk commented 2 years ago

The evaluation of the F0 estimator generally uses the frequency range from 40 to 800 Hz. On the other hand, since this is a rough standard, you can set the value based on the assumed value. For example, you can select around 100 Hz when the input signal is only the female speech.

Your code seems good. However, I don't recommend you to modify the synthesisrealtime.cpp. If you get the f0 from another function, you recommend modifying the function.

OK, thanks a lot.

wizardk commented 2 years ago

@mmorise Hi, I have another question. The audio generated by my acoustic model and world vocoder is not very clear(a little stuffy). I called code_spectral_envelope(sp, 16000, 60) and code_aperiodicity(ap, 16000) for the dimension reduction of acoustic model output. How can I improve the quality of audio?Which factor has the greatest impact on sound quality? F0, SP or AP? Thank you in advance.

mmorise commented 2 years ago

Generally, the timbre degradation is the spectral envelope. WORLD can synthesize the natural speech from speech parameters estimated from waveform in cases with no estimation error. I think that the number of dimensions of the spectral envelope seems enough for the high-quality synthesis. It isn't easy to discuss how to improve speech quality (e.g., The amount of data may not be enough. Unexpected factors often degrade the sound quality of WORLD).