mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.18k stars 253 forks source link

NaN sometimes introduced in coarse aperiodicity estimation #116

Closed 3628800 closed 3 years ago

3628800 commented 3 years ago

Hi, I apologize for the long delay in creating a test file, but I finally have one! This is the output I get:

================ Audio data ================
Bounds: [-1.000000, 1.000000]
================ Estimated F0 ================
Bounds: [0.000000, 439.614216]
================ Estimated Spectral Envelope ================
Bounds: [0.000000, 63.480847]
================ Estimated Aperiodicity ================
Encountered NaN! count: 6953
Bounds: [0.001000, 1.000000]

[ Original "issue" (I guess it was actually a PR...): https://github.com/mmorise/World/pull/92 ]

If you have time, maybe you can take a look? I included a patch file that nominally "solves" the issue, but perhaps there is a better approach.

Please let me know if you have trouble building, I only tested on Mac but I have access to Windows as well.

P.S. I know that a 440Hz sine wave is a bit of a contrived example, but it would be nice if it failed more gracefully... (:

3628800 commented 3 years ago

test.zip

(it would be helpful to actually attach it...)

mmorise commented 3 years ago

Thank you for your comment.

I confirmed this error by using your example. Since I will identify the cause of this error, please give me several days.

mmorise commented 3 years ago

I identified the cause of this problem. LinearSmoothing() in common.cpp outputs NaNs in several frames. Unfortunately, I have no idea to reasonably solve this problem at this time. Since I couldn't fix this bug yet, please give me more time. Fixing the function LinearSmoothing() affects other functions (e.g., CheapTrick), so I must carefully modify the program.

3628800 commented 3 years ago

Sure, thanks

mmorise commented 3 years ago

I completed the identification of the problem. A numerical error in floating-point arithmetic causes this error. On the other hand, I wonder whether we should fix this bug or not because artificial signals only cause it. Conditions of the signal for replicating this error are

  1. The dynamic range in the power spectrum is above about 140 dB.
  2. Power below the lower limit (maximum value - 140 dB) keeps in a certain frequency range (around F0 Hz).

I think that these conditions are not satisfied when the input is speech recorded in a real environment. It can solve this problem by adding a safeguard, but this process requires additional processing costs. Adding a tiny noise to the signal is effective as a pre-processing.

3628800 commented 3 years ago

That sounds reasonable; I don't think I have encountered this issue using synthetic (computer-generated) speech either--just from the non-speech test signals. If the extra processing overhead is negligible, it might be beneficial to add, but either way, thanks for your time and detailed debugging!