Speaker-specific improvements?

mmorise / World

A high-quality speech analysis, manipulation and synthesis system

http://www.kisc.meiji.ac.jp/~mmorise/world/english

Other

1.17k stars 251 forks source link

Speaker-specific improvements? #29

Closed dreamk73 closed 7 years ago

dreamk73 commented 7 years ago

How do you tweak the WORLD parameters to improve the feature extraction for specific voices? I am using WORLD within Merlin to create TTS. I have trained models for a US English male voice and a US English female voice and a Dutch female voice. The results for the male voice sound pretty good. The US English female voice sounded a little bit more buzzy in places but still not bad. I replaced the LF0 track with a different track computed by using the YAAPT algorithm (see https://github.com/bjbschmitt/AMFM_decompy) and it resulted in a slightly better quality with less VUV errors in Merlin. But the Dutch female voice sounds pretty bad, even when just performing copy synthesis.

mmorise commented 7 years ago

There are several possibilities in deterioration of the sound quality. (1) Speech has been recorded in a low-SNR environment. (2) Voiced section also includes a breathy component. This problem is the same as (1). (3) Vocal cord vibrations of speech are not periodic. If the main cause is SNR, you can improve the performance by limiting the frequency range from f0_floor to f0_ceil. On the other hand, you cannot improve it if the cause comes from (3). The vocoder-based system cannot analyze such speech because it assumes that voiced speech is periodic.

If possible, please attach a speech example. I will verify the analysis result.

dreamk73 commented 7 years ago

These are all high-quality studio recordings. It could be due to #2 and #3 in some sections but I think in vowels and sonorants it should sound better. The file in question in the zip file is f1_dutch_ex.wav. I have also included examples from a US English male and female voice that sound much better. But it could still be helpful to set f0_floor and f0_ceil for each voice. Is there an easy way to do this when extracting the features?

ex_wav.zip

mmorise commented 7 years ago

The speech contained a low-frequency component below 50 Hz. Such component may cause the deterioration because it affect a wide frequency band for windowing. One simple approach is high-pass filtering. If the deterioration is caused from the low-frequency component, the sound quality of re-synthesized speech will improve by using the high-pass filter. (But, since I don't check it yet, I don't know whether the hypothesis is right or not.)

dreamk73 commented 7 years ago

That is so strange, I did not notice this before. My original sound files are 22050 Hz and I downsampled them to 16kHz using Festival's ch_wave script. When I look at the spectrograms for both versions and listen to both files I can hear the original sounds much clearer. I am going to look at other conversion algorithms to downsample to 16kHz.