using WORLD vocoder with 8000 Hz speech

mmorise / World

A high-quality speech analysis, manipulation and synthesis system

http://www.kisc.meiji.ac.jp/~mmorise/world/english

Other

1.17k stars 251 forks source link

using WORLD vocoder with 8000 Hz speech #49

Closed tuanad121 closed 6 years ago

tuanad121 commented 6 years ago

I downsampled my speech to 8000 Hz for intelligibility research and failed to use WORLD vocoder to analyze those speech. It's because number_of_aperiodicity = 0. Recall that number_of_aperiodicity = floor(min(upper_limit, fs / 2 - frequency_interval) / frequency_interval) with frequency_interval of 3000; upper_litmit of 15000, when fs = 8000 the number_of_aperiodicity becomes zero.

Is there anyway I can use WORLD vocoder on 8000 Hz speech? Thanks for spending your time on my issue

mmorise commented 6 years ago

I have never tested WORLD for this purpose. But, I think that it can work in cases where the frequency_interval is set to under 2,000.

I'm afraid that I'm too busy to check this problem. If this modification cannot work, please contact me again. I will check the source code in the weekend.

Following is the comment. D4C requires a frequency band to calculate the aperiodicity in a frequency. When the frequency is f_c, the frequency band from f_c - 3000 to f_c + 3000 Hz is used. Since D4C requires the frequency range of at least 6,000 Hz, it cannot support 8000 Hz speech. On the other hand, this frequency range was set to cover the speech with higher F0. I have never tested other frequency ranges, but I think it would also be able to work by using \pm 2000 Hz .

tuanad121 commented 6 years ago

Thanks for your comments. I tried 2000 Hz frequency_interval for analysis/ synthesis and the quality is not good.

mmorise commented 6 years ago

I briefly verified the source code and confirmed that the result was not good as you mentioned. There are several hypotheses in the degradation, and it is difficult to solve this problem in a short period.

In the line 366 in d4c.cpp, the value should be optimized. Current value seems to be too high, and more low value may improve the sound quality. Since it seems that the appropriate value depends on the input speech, parameter optimization would be required.

One temporary modification is to change the value in the line 366 depending on the speech. (But there are other factors associated with the sound quality, and it may not be able to solve your problem) Ideally, the value should be dynamically determined by the input.