p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

pflow/data/text_mel_datamodule.py __getitem__ #5

Closed zidsi closed 8 months ago

zidsi commented 8 months ago
if len(datapoint["wav"]) < 66150:

datapoint['wav'] is [[...]] and len is allways 1. So datapoint["wav"][0] might be what you are looking for.

And for multispeaker datasets, random from same speaker might be better alternative.

p0p4k commented 8 months ago

Yes, absolutely right. Did you train this model on any data?

zidsi commented 8 months ago

Still struggling with configs (hydra) to start training.

p0p4k commented 8 months ago

Let me know where you have any bugs/errors so I can help you get the model running.

zidsi commented 8 months ago

Tnx. Getting there. Nohop redirected log didn't show any progress, but tensorboard is looking good. I'll have to try E2E branch or train HiFiGAN for custom dataset thogh before any comparisment to VITS2.

p0p4k commented 8 months ago

e2e wont work very well yet since there is not KL divergence loss, CFM target density calculation is not supported in my code yet. I will take DiffWave route and try to see what I can do. First goal is to get mel generation working at an acceptable quality. Hope you get good results on your training, please share your experience so we fix any bugs in architecture.

zidsi commented 7 months ago

Single speaker training is progressing. So far I used HiFiGAN V1. VITS version performs better for unseen speaker than LJSpeech. However better than both HiFiGANs - at the moment - is BigVGAN. Still way to go to be even close to VITS2 (1m),

image

Regarding DiffWave... it is posible to eliminate mel spectogram "proxy" completely like https://arxiv.org/pdf/2311.00945.pdf

p0p4k commented 7 months ago

Can you add me on discord? (id: p0p4k)