A prototype for Vits 2 / Yourtts 2

Marioando commented 2 weeks ago

Hi , here is my prototype for vits2, text encoder is not conditionned on speaker.

pivolan commented 2 weeks ago

Hi do you have an example or comparison between vits and vits2?

eginhard commented 2 weeks ago

Cool, thank you! I'll give some comments next week. Could you add an example training recipe, e.g. based on https://github.com/idiap/coqui-ai-TTS/blob/dev/recipes/ljspeech/vits_tts/train_vits.py ? And do you have some samples to share?

Marioando commented 2 weeks ago

The model is still under training but here is some samples : vits2_audio_samples.zip.tar.gz I did train using dvector. Also, duration discriminator is conditionned on speaker which I think should be an improvement over original vits2.

pivolan commented 2 weeks ago

The model is still under training but here is some samples : vits2_audio_samples.zip.tar.gz I did train using dvector. Also, duration discriminator is conditionned on speaker which I think should be an improvement over original vits2.

this is my example in vits v1 german single language. 580_hier-ist-eine-typisc.mp3.zip

eginhard commented 2 weeks ago

Overall it looks good already, thanks. Where possible, could you reuse existing functions and classes? E.g. discriminator.py looks unchanged from the original Vits implementation, so you can just import that. I'll also check that at the end, but you might already know well which parts are the same and which are different.

Otherwise I'll add least need a training recipe for LJSpeech and some basic tests - there were some added here: https://github.com/coqui-ai/TTS/pull/3355/files

Marioando commented 2 weeks ago

Hi, I will add recipe once I got good result from the model. For now this prototype have the following issues that really slow me down for some days now. For vits1 training, accelerate divide training time by 4. Unfortunatly, I cant get it to work with this vits2 implementation. Here is the error message I got :

Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1833, in fit self._fit() File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1785, in _fit self.train_epoch() File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1504, in trainepoch outputs, = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1383, in train_step outputs, loss_dict_new, step_time = self.optimize( File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1251, in optimize grad_norm = self._compute_grad_norm(optimizer) File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1175, in _compute_grad_norm return torch.norm(torch.cat([param.grad.view(-1) for param in self.master_params(optimizer)], dim=0), p=2) File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1175, in return torch.norm(torch.cat([param.grad.view(-1) for param in self.master_params(optimizer)], dim=0), p=2) AttributeError: 'NoneType' object has no attribute 'view'

What I suppose is that the gradient for some parameter are none when using accelerate. Training with trainer.distribute work fine but is 2 times slower than accelerate with half the batch size of accelerate. Any kind of help would be greatly appreciated. Thank you!

idiap / coqui-ai-TTS

A prototype for Vits 2 / Yourtts 2 #137