Closed zidsi closed 1 year ago
Yes, absolutely right. Did you train this model on any data?
Still struggling with configs (hydra) to start training.
Let me know where you have any bugs/errors so I can help you get the model running.
Tnx. Getting there. Nohop redirected log didn't show any progress, but tensorboard is looking good. I'll have to try E2E branch or train HiFiGAN for custom dataset thogh before any comparisment to VITS2.
e2e wont work very well yet since there is not KL divergence loss, CFM target density calculation is not supported in my code yet. I will take DiffWave route and try to see what I can do. First goal is to get mel generation working at an acceptable quality. Hope you get good results on your training, please share your experience so we fix any bugs in architecture.
Single speaker training is progressing. So far I used HiFiGAN V1. VITS version performs better for unseen speaker than LJSpeech. However better than both HiFiGANs - at the moment - is BigVGAN. Still way to go to be even close to VITS2 (1m),
Regarding DiffWave... it is posible to eliminate mel spectogram "proxy" completely like https://arxiv.org/pdf/2311.00945.pdf
Can you add me on discord? (id: p0p4k)
datapoint['wav'] is [[...]] and len is allways 1. So datapoint["wav"][0] might be what you are looking for.
And for multispeaker datasets, random from same speaker might be better alternative.