Closed chazo1994 closed 4 years ago
I want to clarify what is the problem. When you synthesize the audio with natural features, how is the quality? If the quality is still bad, we need to tune the hyperparameters of MB-MelGAN training.
When you synthesize the audio with natural features, how
The quality of audio which generated from natural features is very good.
Could you share the sample of MB-MelGAN with natural features? If the audio sounds good, I think there are something mismatched between the models. Please describe the feature extraction setting.
@kan-bayashi Oke, I will report it tomorrow. Hope you help me.
@kan-bayashi Here is the sample of MB-MelGAN with natural features (the audio sound is very good): sample.zip Melspectrogram of sample: I also compare the audio and melspec output of nvidia-tacotron2 in tranning phase, and the input of MB-MelGAN in trainning phase, both audio and melspec is the same between nvidia-tacotron2 and MB-MelGan if I replace MB-MelGAN preprocessing stage.
The Feature extraction setting of MB-MelGAN:
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
sampling_rate: 22050 # Sampling rate.
fft_size: 1024 # FFT size.
hop_size: 256 # Hop size.
win_length: 1024 # Window length.
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
num_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation.
fmax: 7600 # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0 # Will be multiplied to all of waveform.
trim_silence: true # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048 # Frame size in trimming.
trim_hop_size: 512 # Hop size in trimming.
format: "hdf5" # Feature file format. "npy" or "hdf5" is supported.
I also compare the audio and melspec output of nvidia-tacotron2 in tranning phase, and the input of MB-MelGAN in trainning phase, both audio and melspec is the same between nvidia-tacotron2 and MB-MelGan if I replace MB-MelGAN preprocessing stage.
OK. How did you perform normalization? Did you use nomralized features for both text2mel and vocoder models?
OK. How did you perform normalization?
In the case of use this comment I keep normalized features in trainning phase of MB-MelGan and use original preprocess of nvidia-tacotron2. In the inference phase, I generate melspec from tacotron2 and then convert it by using that code to compatible with Melgan, finally I generate audio from converted melspec.
In the case of replace preprocess of Mb-MelGan by nvidia-tacotron2 preprocess, I remove normalize procedure of MB-melgan both tranning and infererence stage.
In addition, I generate audio from one generated melspec output of nvidia-tacotron2 with Waveglow and MB-Melgan, and I see that the pulse of MB-Melgan output audio is not continuous:
In the case of replace preprocess of Mb-MelGan by nvidia-tacotron2 preprocess, I remove normalize procedure of MB-melgan both training and inference stage.
OK. Then, did you use the same files to train the vocoder and the model? If you just replace the function, please try to generate audio using the mel-spectrogram file which exactly used for the training of tacotron2.
In addition, I generate audio from one generated melspec output of nvidia-tacotron2 with Waveglow and MB-Melgan, and I see that the pulse of MB-Melgan output audio is not continuous:
What is the difference compared to the sample you shared? When I heard your sample, the audio quality is clearly different between GT and generated features. So I wonder that there is a bug in your code. But if the quality degradation is reasonable, that may be the problem of MB-MelGAN.
If you just replace the function, please try to generate audio using the mel-spectrogram file which exactly used for the training of tacotron2.
I did this way, and get the bad audio quality (same that quality result of tacotron2+MB-Melgan), so I will debug this point and report results here. Please, wait my response.
When I heard your sample, the audio quality is clearly different between GT and generated features.
the quality of GT audio and generated audio by MB-MelGan from natural features is same.
I did this way, and get the bad audio quality (same that quality result of tacotron2+MB-Melgan), so I will debug this point and report results here. Please, wait my response.
Then, there is a bug in your code. Please carefully check the difference (e.g., log_e vs log_10).
The problem is resolved, I realized that the cause of the problem was because I kept the same fmax and fmin as the default configuration of this repos while fmax and fmin of nividia-tacotron2 is different.
I trained Multiband-Melgan model and intergrate with Nvidia-tacotron2 model, I also use this comment to make it work. But the results voice is bad with discontinuous pitch. The melspechtrogram below show the difference melspectrogram of output wave files of tacotron2+waveglow and tacotron2+MB_melgan (tacotron2+waveglow have great audio output). I try to replace the preprocess of this repos by nvidia-tacotron2 repos, but the results is same.
Tacotron2+waveglow:
Tacotron2+Mb_Melgan:
Tacotron2+Mb_Melgan (Replaced preprocess):
I also attached results audio. results.zip
@kan-bayashi Can you have any idea to fix this problem?