I'm trying to use my own data for training. I've tested with the LJSpeech dataset, which even after a few thousand steps produces speech-like audio. Yet, training on my dataset (16000 Hz), it comes out as plain noise after even 40,000 steps. I'm assuming this is because of the audio hparams settings, where I changed the sample rate from 20000 to 16000, but I'm not sure what to change them to. For a 20000 hz audio, the length of frames are much shorter than the default setting, and I'm not sure what the frame shift is used for either. Is this something you tune by hand or is there a way to calculate these values? Thanks.
I'm trying to use my own data for training. I've tested with the LJSpeech dataset, which even after a few thousand steps produces speech-like audio. Yet, training on my dataset (16000 Hz), it comes out as plain noise after even 40,000 steps. I'm assuming this is because of the audio hparams settings, where I changed the sample rate from 20000 to 16000, but I'm not sure what to change them to. For a 20000 hz audio, the length of frames are much shorter than the default setting, and I'm not sure what the frame shift is used for either. Is this something you tune by hand or is there a way to calculate these values? Thanks.