microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

Sample Rates are different between speech pre-training dataset and tts dataset #14

Closed Maggione closed 2 years ago

Maggione commented 2 years ago

Hi, In the paper of SpeechT5, it used librispeech-960h as the speech pre-training dataset whose sample rate is 16kHz, while it used libri-tts as the tts dataset whose sample rate is 24kHz. How do you deal with this mismatch? Thank you! :)

mechanicalsea commented 2 years ago

Hi, Maggione Considering SpeechT5 inputs 16kHz sample rate waveforms, we resampled the LIBRITTS waveforms from 24kHz to 16kHz. The down-sampling details are as follows.

import soundfile as sf
import librosa
# file = ...
# new_file = ...
audio, fs = sf.read(file)
x = librosa.resample(audio, fs, 16000)
sf.write(str(new_file), x, 16000)