mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.17k stars 251 forks source link

DIO vs harvest #46

Closed dreamk73 closed 6 years ago

dreamk73 commented 7 years ago

I saw you at Interspeech last week and promised I would give you some feedback. I am using the WORLD vocoder in Merlin to train acoustic models. I have been running a comparison between different F0 extraction methods. When I ran Harvest this week, I noticed that the BAP error after training is really low, almost zero. That seemed a little suspect to me. The objective distance measures seem a little bit lower than with DIO, but when I listen to the audio, something seems off. I hear more noise for some reason?

I have used other F0 extraction algorithms as well. All of these use the WORLD vocoder and the other parameters are extracted after F0. We have made sure the generated F0 contours have the same number of frames.

Here are results for our 16kHz UK English female voice: ex.zip dio: MCD: 4.815 dB; BAP: 0.158 dB; F0:- RMSE: 22.176 Hz; CORR: 0.726; VUV: 10.589% harvest: MCD: 4.829 dB; BAP: 0.005 dB; F0:- RMSE: 31.489 Hz; CORR: 0.626; VUV: 7.133% decompy: MCD: 4.808 dB; BAP: 0.138 dB; F0:- RMSE: 32.638 Hz; CORR: 0.550; VUV: 4.846% reaper: MCD: 4.941 dB; BAP: 0.143 dB; F0:- RMSE: 29.626 Hz; CORR: 0.636; VUV: 6.551% swipe: MCD: 4.807 dB; BAP: 0.144 dB; F0:- RMSE: 19.420 Hz; CORR: 0.777; VUV: 6.238%

mmorise commented 7 years ago

Thank you for your interesting report.

The sound synthesized by DIO contains a noise. It would be caused by the VUV detection error. On the other hand, the sound synthesized by Harvest contains a noise that is another aspect. It would be caused by the aperiodicity in around 3 kHz. The lower value compared with the appropriate value causes such strange timbre. In particular, since the aperiodicity at the boundary between V/UV is not stable, such deterioration is often observed.

However, I didn't find a clear error by a brief analysis. Such lower aperiodicity may be contained in training.