Open egorsmkv opened 6 months ago
Consider your data distribution. If your dataset has a lot of short lengths, 3s masking loss would be devastating. And short prompt leads to poor speaker similarity. I try 1s prompt replication method like Hierspeech++, it doesn't work well.
@syj901220 thanks, I'll look into the data.
I am training pflowtts for Ukrainian with phonemes and seeing the following:
Loss is going down:
What can it be or how I can debug more about this distortion?