Closed wyp19930313 closed 3 years ago
I think that your question is out of focus of the channel vocoder, such as WORLD. For example, a target singing style (e.g., F0 contour of a professional singer) is generally useful to tune the singing voice. WORLD is effective to obtain the target singing style, but another signal processing technique is required to tune the singing style. There are several signal processing techniques for this purpose, and they are independent of the WORLD vocoder.
I can understand what you mean, thank you very much for your answers.
When I use the world vocoder, the sampling rate of the human voice is 16kHz. After the sound is decomposed and synthesized, the sound quality will be lost. I guess this is caused by the inaccurate f0 detection. How to improve the sound quality after sound decomposition and synthesis?
F0 estimation error is often observed in several causes. If the input speech is recorded in a noisy environment, it is difficult to solve the problem. Since the speech waveform is distorted, the noise reduction would not be effective. When the speech is clean, tuning parameters on the floor and ceil frequencies may improve the estimation performance. If you have used the Dio() for the F0 estimation, please use Harvest() instead of Dio().
Indeed, the effect of using Harvest() is better than Dio(). I use headphones to record on my phone, and the noise is so small that it is difficult for me to hear the noise. But the sound after decomposition and synthesis still produces some noise
When the microphone has a function for cutting the lower frequency noise, you may not estimate the accurate F0 because both F0 estimators use the fundamental component. You can examine the cause by the power spectral analysis in the whole waveform. If it is difficult for you to carry out the analysis, please give the audio file to me. I'll check the cause of the error.
Hello, thank you very much for your attention.
I change the F0 of the audio with the same code. The quality of audio 1.wav is higher and acceptable after synthesis, but the quality of audio 2.wav is poor and unacceptable after synthesis.
My question:
Thank you for your help.
Appendix 1、Audio file demo.zip
2、My python3 code.
import librosa
import pyworld as pw
import numpy as np
if __name__ == '__main__':
infile = '1.wav'
out_file = f'tune_{infile}'
data, sr = librosa.load(infile, sr=16000, dtype=np.float64)
f0, t = pw.harvest(data, sr, f0_floor=71, f0_ceil=800, frame_period=5)
sp = pw.cheaptrick(data, f0, t, sr)
ap = pw.d4c(data, f0, t, sr)
pyworld_data = pw.synthesize(f0 * 1.5, sp, ap, sr, frame_period=5)
librosa.output.write_wav(out_file, pyworld_data, sr)
I checked the audio file. In the attached figure, it is difficult to analyze the waveform because the temporal interval between vocal cord vibration is unstable. The vocoder cannot extract F0, and the analysis algorithm estimates that it is the unvoiced section. In the vocoder-based algorithm, this is an unsolvable problem. And, since this repository is WORLD of C++ version, please see the pyworld repository If you want to require the python information.
Therefore, after the sound is decomposed and synthesized, the sound quality is reduced. Is it due to the error in the F0 estimation?
Yes. There are several causes in the F0 estimation error. For example, when the voiced section is wrongly estimated as the unvoiced section, the sound quality of synthesized speech is fatally degraded.
When I use this algorithm to tune the singing voice, the synthesized sound quality is a bit poor. How to improve the quality of the generated sound to achieve the effect of software such as Autotune or ChangBa (an intelligent tune music software,唱吧)?