mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.19k stars 255 forks source link

fftlen=512 will introduce noise (burrs) #98

Closed hdmjdp closed 4 years ago

hdmjdp commented 4 years ago

图片 图片

hdmjdp commented 4 years ago

@mmorise hi, dou you know what cause this?

mmorise commented 4 years ago

Since there are too many possibilities, I cannot answer the question from only provided figures. Please give me detailed information and explanation, including waveform and source code.

hdmjdp commented 4 years ago

@mmorise I get the acoustic features(sp, bap, f0) using analysis code with fft_len=512. And then I just use these features to synthesis the wav below.

512.wav.zip

mmorise commented 4 years ago

Thank you for your information.

The cause is the lack of signal length in low-F0 frames. For example, D4C requires a length of 4*T0. In cases where the F0 at a frame is 100 Hz, D4C uses the frame length of 40 ms. The sampling frequency of 512.wav was 24,000 Hz. The frame length of 40 ms is 960 samples, so the fft_size of 512 cannot cover this frame length. When the fft_size is 512, you can use the F0 around 188 Hz as the lower limit.

hdmjdp commented 4 years ago

@mmorise Thank you, I will try.

hdmjdp commented 4 years ago

@mmorise But,if i use f0_floor=188, the synthesis wav will be dummy as below. how to solvle this. 188.wav.zip

mmorise commented 4 years ago

Sorry, 188 Hz is the lower limit that you can analyze/synthesize the speech when you use the fft_size of 512 samples. In other words, you can process speech with the F0s at least above 188 Hz in all frames. Unfortunately, the F0s of your sample contained frames with the F0 below 188 Hz, so you cannot analyze/synthesize it by using the fft_size of 512 samples.