Open Marioando opened 2 weeks ago
Hi do you have an example or comparison between vits and vits2?
Cool, thank you! I'll give some comments next week. Could you add an example training recipe, e.g. based on https://github.com/idiap/coqui-ai-TTS/blob/dev/recipes/ljspeech/vits_tts/train_vits.py ? And do you have some samples to share?
The model is still under training but here is some samples : vits2_audio_samples.zip.tar.gz I did train using dvector. Also, duration discriminator is conditionned on speaker which I think should be an improvement over original vits2.
The model is still under training but here is some samples : vits2_audio_samples.zip.tar.gz I did train using dvector. Also, duration discriminator is conditionned on speaker which I think should be an improvement over original vits2.
this is my example in vits v1 german single language. 580_hier-ist-eine-typisc.mp3.zip
Overall it looks good already, thanks. Where possible, could you reuse existing functions and classes? E.g. discriminator.py
looks unchanged from the original Vits implementation, so you can just import that. I'll also check that at the end, but you might already know well which parts are the same and which are different.
Otherwise I'll add least need a training recipe for LJSpeech and some basic tests - there were some added here: https://github.com/coqui-ai/TTS/pull/3355/files
Hi, I will add recipe once I got good result from the model. For now this prototype have the following issues that really slow me down for some days now. For vits1 training, accelerate divide training time by 4. Unfortunatly, I cant get it to work with this vits2 implementation. Here is the error message I got :
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1833, in fit
self._fit()
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1785, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1504, in trainepoch
outputs, = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1383, in train_step
outputs, loss_dict_new, step_time = self.optimize(
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1251, in optimize
grad_norm = self._compute_grad_norm(optimizer)
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1175, in _compute_grad_norm
return torch.norm(torch.cat([param.grad.view(-1) for param in self.master_params(optimizer)], dim=0), p=2)
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1175, in
What I suppose is that the gradient for some parameter are none when using accelerate. Training with trainer.distribute work fine but is 2 times slower than accelerate with half the batch size of accelerate. Any kind of help would be greatly appreciated. Thank you!
Hi , here is my prototype for vits2, text encoder is not conditionned on speaker.