Open Tian14267 opened 3 years ago
If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.
The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s
Now transforming that to 16khz: your window length: 16000 0,046439909 = 743,038544 ~ 743 your hop length: 16000 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same
So for example 16khz preprocess.yaml would look like:
preprocessing:
val_size: 512
text:
text_cleaners: ["english_cleaners"]
language: "en"
audio:
sampling_rate: 16000
max_wav_value: 32767.0
stft:
filter_length: 743
hop_length: 186
win_length: 743
mel:
n_mel_channels: 80
mel_fmin: 0
mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
pitch:
feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
normalization: True
energy:
feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
normalization: True
Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio.
If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.
The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s
Now transforming that to 16khz: your window length: 16000 0,046439909 = 743,038544 ~ 743 your hop length: 16000 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same
So for example 16khz preprocess.yaml would look like:
preprocessing: val_size: 512 text: text_cleaners: ["english_cleaners"] language: "en" audio: sampling_rate: 16000 max_wav_value: 32767.0 stft: filter_length: 743 hop_length: 186 win_length: 743 mel: n_mel_channels: 80 mel_fmin: 0 mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder pitch: feature: "phoneme_level" # support 'phoneme_level' or 'frame_level' normalization: True energy: feature: "phoneme_level" # support 'phoneme_level' or 'frame_level' normalization: True
Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio. Thank you very much ! So I also need to train a new Hifi-Gan model with sample_rate=16K? Another question, If I finetune my own data in authors model, and get my voice. How can I do it in right way ? The result of my finetune is bad, Maybe I am not in right way ?
No, you dont need to train a new hifigan, the output of hifigan will be 22050 hz even though you trained fastspeech on 16khz mel spectrograms
If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.
The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s
Now transforming that to 16khz: your window length: 16000 0,046439909 = 743,038544 ~ 743 your hop length: 16000 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same
So for example 16khz preprocess.yaml would look like:
preprocessing: val_size: 512 text: text_cleaners: ["english_cleaners"] language: "en" audio: sampling_rate: 16000 max_wav_value: 32767.0 stft: filter_length: 743 hop_length: 186 win_length: 743 mel: n_mel_channels: 80 mel_fmin: 0 mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder pitch: feature: "phoneme_level" # support 'phoneme_level' or 'frame_level' normalization: True energy: feature: "phoneme_level" # support 'phoneme_level' or 'frame_level' normalization: True
Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio.
Your guide so simple to understand but I saw something very strange with my custom data
I only change sampling rate to 16000 ( preprocessor.py librosa.load(wav_path, sr=16000)) You can hear difference voice between 2 wav ( 37.wav is raw data with sr 16000hz) the speech is same but the voice is same as read with low freq 37.zip 37_reconstruced.zip
I change all parameter as you cofig ( sr, filter_length, hop_length, win_length) As a result, the voice is also different and the speed is slower 37_reconstructed_2.zip
Can you explain and give me some advices???
mark
If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.
后来解决了吗
hello, guys, I have a question here: If my data sample_rate is 16k, and I want use this 16k data to train model . How can I modify parameter ? and the model of Hifi-Gan, how can get this model with sample_rate=16k, and what param should I change?