Retraining the model with 16k sampling rate data

ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

MIT License

1.85k stars 539 forks source link

Retraining the model with 16k sampling rate data #69

Open Adliyan opened 3 years ago

Adliyan commented 3 years ago

I modified the sampling rate parameter of preprocess.yaml to 16k, and retrained fastspeech2 and hifigan with data with a sampling rate of 16k, but the resultant synthesized speech was very strange. Using 16k to play the synthesized speech was very slow. Instead, I used 22k. The playback is a lot normal, just like the synthesized voice is still at a sampling rate of 22k. So I want to ask if there are other parameters in the model that affect the sampling rate of the model's synthesized speech.

yileld commented 3 years ago

I encountered the same problem. But I didnt retrain the hifigan so I thought it was the reason. Now I just resample the wav file to 22050 and retrain. Do you change all the steps that need the sampling rate? Like preprocess.py line 172 reading wav file, it didnt have sampling rate parameter in original code.

dan-wells commented 3 years ago

Changing that single line in preprocessor/preprocessor.py fixed this issue for me, training with 16 kHz audio. Thanks for the pointer!

azman-i commented 3 years ago

@dan-wells where did you change with sampling rate?Can you please share the code?And can we use wav file with different sampling rate in dataset for this model?

massimo1980 commented 3 years ago

@azman63

i suppose, from this: wav, _ = librosa.load(wavpath) to this: wav, = librosa.load(wav_path,sr=16000)

aidosRepoint commented 2 years ago

Changing that single line in preprocessor/preprocessor.py fixed this issue for me, training with 16 kHz audio. Thanks for the pointer!

thank you so much! I completely forgot that librosa loads audio as 22050Hz by default