mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.18k stars 253 forks source link

How To Improve Sound Quality To Tune Singing Voice? #108

Closed wyp19930313 closed 3 years ago

wyp19930313 commented 3 years ago

When I use this algorithm to tune the singing voice, the synthesized sound quality is a bit poor. How to improve the quality of the generated sound to achieve the effect of software such as Autotune or ChangBa (an intelligent tune music software,唱吧)?

mmorise commented 3 years ago

I think that your question is out of focus of the channel vocoder, such as WORLD. For example, a target singing style (e.g., F0 contour of a professional singer) is generally useful to tune the singing voice. WORLD is effective to obtain the target singing style, but another signal processing technique is required to tune the singing style. There are several signal processing techniques for this purpose, and they are independent of the WORLD vocoder.

wyp19930313 commented 3 years ago

I can understand what you mean, thank you very much for your answers.

When I use the world vocoder, the sampling rate of the human voice is 16kHz. After the sound is decomposed and synthesized, the sound quality will be lost. I guess this is caused by the inaccurate f0 detection. How to improve the sound quality after sound decomposition and synthesis?

mmorise commented 3 years ago

F0 estimation error is often observed in several causes. If the input speech is recorded in a noisy environment, it is difficult to solve the problem. Since the speech waveform is distorted, the noise reduction would not be effective. When the speech is clean, tuning parameters on the floor and ceil frequencies may improve the estimation performance. If you have used the Dio() for the F0 estimation, please use Harvest() instead of Dio().

wyp19930313 commented 3 years ago

Indeed, the effect of using Harvest() is better than Dio(). I use headphones to record on my phone, and the noise is so small that it is difficult for me to hear the noise. But the sound after decomposition and synthesis still produces some noise

mmorise commented 3 years ago

When the microphone has a function for cutting the lower frequency noise, you may not estimate the accurate F0 because both F0 estimators use the fundamental component. You can examine the cause by the power spectral analysis in the whole waveform. If it is difficult for you to carry out the analysis, please give the audio file to me. I'll check the cause of the error.

wyp19930313 commented 3 years ago

Hello, thank you very much for your attention.

I change the F0 of the audio with the same code. The quality of audio 1.wav is higher and acceptable after synthesis, but the quality of audio 2.wav is poor and unacceptable after synthesis.

My question:

  1. What causes it?
  2. How to improve the quality of Audio 2.wav after synthesis?

Thank you for your help.

Appendix 1、Audio file demo.zip

2、My python3 code.

import librosa
import pyworld as pw
import numpy as np

if __name__ == '__main__':
    infile = '1.wav'
    out_file = f'tune_{infile}'
    data, sr = librosa.load(infile, sr=16000, dtype=np.float64)
    f0, t = pw.harvest(data, sr, f0_floor=71, f0_ceil=800, frame_period=5)
    sp = pw.cheaptrick(data, f0, t, sr)
    ap = pw.d4c(data, f0, t, sr)
    pyworld_data = pw.synthesize(f0 * 1.5, sp, ap, sr, frame_period=5)
    librosa.output.write_wav(out_file, pyworld_data, sr)
mmorise commented 3 years ago

I checked the audio file. In the attached figure, it is difficult to analyze the waveform because the temporal interval between vocal cord vibration is unstable. The vocoder cannot extract F0, and the analysis algorithm estimates that it is the unvoiced section. In the vocoder-based algorithm, this is an unsolvable problem. And, since this repository is WORLD of C++ version, please see the pyworld repository If you want to require the python information. waveform

wyp19930313 commented 3 years ago

Therefore, after the sound is decomposed and synthesized, the sound quality is reduced. Is it due to the error in the F0 estimation?

mmorise commented 3 years ago

Yes. There are several causes in the F0 estimation error. For example, when the voiced section is wrongly estimated as the unvoiced section, the sound quality of synthesized speech is fatally degraded.