p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

Distortion of audio #20

Open egorsmkv opened 6 months ago

egorsmkv commented 6 months ago

I am training pflowtts for Ukrainian with phonemes and seeing the following:

telegram-cloud-photo-size-2-5233706955334340576-y

Loss is going down:

telegram-cloud-photo-size-2-5233706955334340577-y

What can it be or how I can debug more about this distortion?

syj901220 commented 6 months ago

Consider your data distribution. If your dataset has a lot of short lengths, 3s masking loss would be devastating. And short prompt leads to poor speaker similarity. I try 1s prompt replication method like Hierspeech++, it doesn't work well.

egorsmkv commented 6 months ago

@syj901220 thanks, I'll look into the data.