Hello! I am working with jsut. Model generates synthesized audio after every 10,000 steps of checkpoint. But when I try to synthesize my own text with that particular checkpoint, I see a huge qualitative difference. Infact the custom synthesized audio is no where near the the training time generated audio in terms of quality.
Why does this happen? And what is that audio that is generated during training time?
Hello! I am working with jsut. Model generates synthesized audio after every 10,000 steps of checkpoint. But when I try to synthesize my own text with that particular checkpoint, I see a huge qualitative difference. Infact the custom synthesized audio is no where near the the training time generated audio in terms of quality. Why does this happen? And what is that audio that is generated during training time?