Closed tuanad121 closed 6 years ago
Thank you for your report. I briefly checked the sound quality and speech parameters.
Perhaps, the cause is the aperiodicity in the lower frequency band. WORLD skips the aperiodicity estimation in 0 Hz because many speech has a low aperiodicity at 0 Hz. In your speech, WORLD gives the lower aperiodicity than the original.
If you can use a post-processing, you can manipulate the aperiodicity in lower frequency band. You would remove the buzzy noise by increasing it, but the tuning may be difficult.
I will try to improve this processing later, but I think that it is a little risky because it is a possibility that another noise would be caused by this implementation. If I achieve it without deterioration of the sound quality, I will released it.
FWIW, I've run into similar issues quite frequently. It would be great if this aspect of WORLD could be improved!
In this example, besides the overall reduction of breathiness/overly harmonic sound, in my perception especially noticeable are the low-energy "almost unvoiced" regions, e.g. around 2.44s and 2.82s-2.92s. These areas have few harmonics besides the fundamental in the input, but lots of harmonics in the resynthesis.
Setting those frames to unvoiced can sometimes sound a little better, but, as you say, it would be nicer if this could be handled through aperiodicity. The current low resolution band-wise aperiodicity will probably have a hard time reproducing these things, e.g. just a fundamental and aperiodic higher frequencies. I have sometimes wondered if something like a voicing frequency could offer a relatively simple work-around for these kind of cases.
I would be curious about how to best resolve this issue as well.
I have been testing several ideas, but I can't solve this problem yet. The main problem is how to estimate the aperiodicity at 0 Hz or f_o Hz, Current version gives the aperiodicity of 0 at the 0 Hz. In many speech, this approach works well, but it is not the perfect solution as you know.
I think D4C is inappropriate to solve this problem, so I attempt to solve it by another idea. If you have an idea, please give me the information.
@mmorise thanks for your answer. I appreciate it. You're right. It's a tricky problem. Unfortunately, I don't have any idea yet. It's out of scope, I know that: in source filter model, excitation pulses is input of a filter (vocal tract) with the frequency response is the spectral envelope. The output of the process is speech signal. I'm not sure how aperiodicity (which is a ratio between "aperiodic" power and total power) takes part in the process of synthesizing speech signal. From the scripts, I guess our aperiodicity will do something with excitation pulses and spectral envelope. But I'm not sure intuitively how it takes part in.
You should check out the vocoder code at https://github.com/gillesdegottex/pulsemodel
@gillesdegottex is doing some excellent work on aperiodicity modelling, and he just added WORLD spectral envelope support to his repository.
@ljuvela: great news, thanks for your reply ^^ I will check it.
Thank you for your interesting information. I'll also check it.
After INTERSPEECH, I have tested some ideas but get no good result. In the comparison between STRAIGHT vs. D4C, D4C relatively achieved better result in many cases. Morphing them could not improve the sound quality. Another modeling like pulse model may be useful to solve this problem.
Thank you for sharing! I am excited to try a new method.
On Sep 4, 2017, at 9:50 PM, mmorise notifications@github.com wrote:
Thank you for your interesting information. I'll also check it.
After INTERSPEECH, I have tested some ideas but get no good result. In the comparison between STRAIGHT vs. D4C, D4C relatively achieved better result in many cases. Morphing them could not improve the sound quality. Another modeling like pulse model may be useful to solve this problem.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mmorise/World/issues/44#issuecomment-327070544, or mute the thread https://github.com/notifications/unsubscribe-auth/AL0653cBoAO7Y6l08TeKRSDqJ5h7SdEkks5sfNMjgaJpZM4OaxEt.
I use WORLD for analysis/synthesis a French database. I realize that re-synthesized released sounds in utterances usually have buzzy noise. I'm not sure why. I attached an original speech, synthesized speech and my script. In the case, the buzzy noise in re-synthesized can be perceived from 2.451 s to 2.983 s (I attached snapshot of the segment in segment.png)
Thanks for spending your time in my case. I appreciate it. problem.zip