mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.17k stars 251 forks source link

Different microphones record a paragraph at the same time and extract f0 with great difference? #77

Closed leokwu closed 5 years ago

leokwu commented 5 years ago

issue The top half of the graph is the correct f0 curve and the bottom half is the wrong f0 curve.

diff_f0.zip

Problem description:

  1. The two wav files included in the compression package are the same voice recorded by a good microphone and a poor microphone at the same time;
  2. The f0 curve extracted from the wav file recorded by a good microphone looks beautiful, while the f0 curve of the voice recorded by a poor microphone is set to 0 where there is sound.

    Is there a problem with the sound and no sound detection algorithm in f0? Expect help or answers!

    thks.

leokwu commented 5 years ago

supplementary instruction:

wwm_ste_good.wav file spectrogram: good

wwm_ste_bad.wav file spectrogram: bad

leokwu commented 5 years ago

supplementary instruction: In the attachment "diff_f0.zip", wwm_ste_good.wav : Good microphone for recording audio files. wwm_ste_bad.wav: Poor microphone recording of audio files. wwm_ste_good.f0: "f0analysis wwm_ste_good.wav -o wwm_ste_good.f0 -t " command to generate. wwm_ste_bad.f0 : "f0analysis wwm_ste_bad.wav -o wwm_ste_bad.f0 -t " command to generate.

mmorise commented 5 years ago

The poor microphone seems to not be able to record the low frequency band. Since WORLD uses an F0 estimator by using low-frequency component, you cannot obtain the accurate F0 from such waveform. Human being can perceive pitch information from this waveform by missing fundamental, but it is generally difficult to estimate the F0 of such speech.

leokwu commented 5 years ago

Thank you very much for your reply.

Is there no way to solve this situation?

Another question: The WORLD tool seems to sometimes identify the formant as the fundamental frequency. Does it matter if the fundamental frequency is weak, or can the first formant be approximately equivalent to the fundamental frequency?

mmorise commented 5 years ago

I think that it is difficult to recover the missing component after recording. So, I recommend to re-record speech.

WORLD does not provide a function for estimating the formant frequency. However, the first formant (F1) may be wrongly output as the F0, as you pointed out. (The F1 of vowel /i/ is observed around 300 Hz. In cases where the speaker is women with higher pitch, the F1 occasionally indicates a similar value to the F0).

leokwu commented 5 years ago

Thank you very much for your reply.

(The F1 of vowel /i/ is observed around 300 Hz. In cases where the speaker is women with higher pitch, the F1 occasionally indicates a similar value to the F0).

Is there a way to fix or fix this situation?

mmorise commented 5 years ago

If the F0 is same as F1, the estimated result is accurate. I think that you don't need to fix it.

leokwu commented 5 years ago

Thank you very much for your reply.

error As shown in the figure above, the part circled in red is much beyond the base frequency, but it is also extracted as the base frequency. Is there any way to repair this situation?

leokwu commented 5 years ago

Supplement: Is there any good algorithm to evaluate the quality of audio files?

jiangzhengliang commented 5 years ago

Dear mmorise, I also use pyworld to analysis and synthesize speech wav, I also meet the leokwu's problem. I have two question

  1. when f0 is not detected, if I use the disassemb f0 ap sp to synthesize voice, the synthetic voice has noise.
  2. when f0 is not detected, can I use f1 to synthesize voice? thanks

Regards Bobby

mmorise commented 5 years ago

To @leokwu This error is called "double pitch error" that is often observed in the F0 estimation. There are several reasons in this error, but it is observed in cases where the power at F0 Hz is weak. You may obtain the accurate F0 by appropriately setting the parameters f0_floor and f0_ceil if you know the F0 range in the input speech.

Is there any good algorithm to evaluate the quality of audio files?

In the narrow-band speech (fs: 16 kHz), PESQ would be a candidate for this purpose.

To @jiangzl1977

  1. when f0 is not detected, if I use the disassemb f0 ap sp to synthesize voice, the synthetic voice has noise.

Yes. F0 estimation is most important for vocoding process. The noise is used for synthesis in the unvoiced period. When the voiced period is wrongly identified as the unvoiced period, the sound quality would be strongly degraded.

  1. when f0 is not detected, can I use f1 to synthesize voice?

No. F0 is used for generating the vocal cord vibrations. It is different from F1 that is a spectral feature in the vocal tract.

leokwu commented 5 years ago

Thank you very much for your reply. @mmorise adjust_f0 As shown in the figure above, if f0 threshold values f0_floor and f0_ceil are adjusted, it can be corrected that some of the alternative values of f0 fall within the threshold range. If there is no threshold range in the alternative values, it is set to 0. Is it abnormal to set to 0?

about PESQ: PESQ is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. I hope to do no reference audio quality evaluation, is there a good algorithm to recommend?

mmorise commented 5 years ago

In WORLD vocoder, the value 0 means that this frame is unvoiced frame. This value affects the spectral envelope and aperiodicity estimations. In the synthesis part, the noise is used in this frame.

Audio quality evaluation without reference is generally difficult. I think that AutoMOS (https://ai.google/research/pubs/pub45744) matches for this purpose (but it is for text-to-speech study). Since I don't have a knowledge in this algorithm, please see the paper if you are interested.

leokwu commented 5 years ago

Thank you very much for your reply. A small suggestion about the WORLD source and HARVEST algorithm code The range of values(upper , lower) , It can be relaxed. The effect of this is to give more f0 candidates.In the case of poor audio quality, the accuracy of f0 extraction can be improved, and a more beautiful f0 curve can be obtained. adjust_Wc_param_compare The upper part of the figure is the default parameter, and the lower part is the effect after adjusting the parameter. The screenshot of this parameter is as follows: Paper_screenshots I have tried many sets of parameters, and the effect is better in 0.8,1.2, and does not affect the audio file f0 extraction with good voice quality.

I have tried to adjust other parameters in the harvest algorithm (refer to the thesis). As far as the experimental results are concerned, only the adjustment of "Harvest removes any estimated candidate that is not included in the range of ωc ± 10%" has a good effect on the results.

adjust_ωc_experiment.zip In the ttached file: wwm_16.wav : the original audio. wwm_16_syn_org.wav: extracts f0 with default parameters and recombines it. wwm_16_adjust_ωc.wav: after adjusting c, f0 was extracted and resynthesized.

I hope this is helpful for improving harvest algorithm.

leokwu commented 5 years ago

I have another question to ask you. Is there background noise in speech, which has influence on f0 extraction? In what aspects?

thks.

mmorise commented 5 years ago

The parameters in Harvest are optimized by using only two speech databases (Japanese and English). The optimal value may depend on the language and recording conditions. So I think that the adjustment of parameters for each speech would obtain a good result.

Is there background noise in speech, which has influence on f0 extraction? In what aspects?

Yes. Harvest has a certain robustness against the noise, but other algorithms based on correlation (e.g. YIN) are generally more robust. In particular, the noise in lower frequency affects the performance of F0 estimation in Harvest.