ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.75k stars 525 forks source link

How can I train this model with data in sample_rate=16k ? #125

Open Tian14267 opened 2 years ago

Tian14267 commented 2 years ago

hello, guys, I have a question here: If my data sample_rate is 16k, and I want use this 16k data to train model . How can I modify parameter ? and the model of Hifi-Gan, how can get this model with sample_rate=16k, and what param should I change?

dunky11 commented 2 years ago

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s

Now transforming that to 16khz: your window length: 16000 0,046439909 = 743,038544 ~ 743 your hop length: 16000 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same

So for example 16khz preprocess.yaml would look like:

preprocessing:
  val_size: 512
  text:
    text_cleaners: ["english_cleaners"]
    language: "en"
  audio:
    sampling_rate: 16000
    max_wav_value: 32767.0
  stft:
    filter_length: 743
    hop_length: 186
    win_length: 743
  mel:
    n_mel_channels: 80
    mel_fmin: 0
    mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
  pitch:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True
  energy:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True

Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio.

Tian14267 commented 2 years ago

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s

Now transforming that to 16khz: your window length: 16000 0,046439909 = 743,038544 ~ 743 your hop length: 16000 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same

So for example 16khz preprocess.yaml would look like:

preprocessing:
  val_size: 512
  text:
    text_cleaners: ["english_cleaners"]
    language: "en"
  audio:
    sampling_rate: 16000
    max_wav_value: 32767.0
  stft:
    filter_length: 743
    hop_length: 186
    win_length: 743
  mel:
    n_mel_channels: 80
    mel_fmin: 0
    mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
  pitch:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True
  energy:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True

Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio. Thank you very much ! So I also need to train a new Hifi-Gan model with sample_rate=16K? Another question, If I finetune my own data in authors model, and get my voice. How can I do it in right way ? The result of my finetune is bad, Maybe I am not in right way ?

dunky11 commented 2 years ago

No, you dont need to train a new hifigan, the output of hifigan will be 22050 hz even though you trained fastspeech on 16khz mel spectrograms

dohuuphu commented 2 years ago

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s

Now transforming that to 16khz: your window length: 16000 0,046439909 = 743,038544 ~ 743 your hop length: 16000 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same

So for example 16khz preprocess.yaml would look like:

preprocessing:
  val_size: 512
  text:
    text_cleaners: ["english_cleaners"]
    language: "en"
  audio:
    sampling_rate: 16000
    max_wav_value: 32767.0
  stft:
    filter_length: 743
    hop_length: 186
    win_length: 743
  mel:
    n_mel_channels: 80
    mel_fmin: 0
    mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
  pitch:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True
  energy:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True

Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio.

Your guide so simple to understand but I saw something very strange with my custom data

  1. I only change sampling rate to 16000 ( preprocessor.py librosa.load(wav_path, sr=16000)) You can hear difference voice between 2 wav ( 37.wav is raw data with sr 16000hz) the speech is same but the voice is same as read with low freq 37.zip 37_reconstruced.zip

  2. I change all parameter as you cofig ( sr, filter_length, hop_length, win_length) As a result, the voice is also different and the speed is slower 37_reconstructed_2.zip

Can you explain and give me some advices???

leslie2046 commented 2 years ago

mark

leslie2046 commented 2 years ago

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

后来解决了吗