jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.52k stars 740 forks source link

more training details of the TTS enhanced models #111

Open zjlww opened 5 months ago

zjlww commented 5 months ago

Hi, thank you for open-sourcing your excellent work. ❤️

I would like to compare with VoiceCraft as a baseline for my research. I have observed that you have released three TTS enhanced models. I am curious about the training datasets used for all these models. Can I utilize them to evaluate zero-shot TTS models?

jasonppy commented 5 months ago

Thanks! 830M TTS enhanced and 330M TTS enhanced (to be uploaded) are trained on gigaspeech + lightlight. I recommend using 830M TTS enhanced to evaluate.

rlenain commented 5 months ago

Hi @jasonppy -- I'm curious, if you can spare the details, how exactly did you train the TTS enhanced model compared to the base model? Is it a separate training script? Separate loss? Or simply separate data?

Thanks a lot.

jasonppy commented 5 months ago

Hi @jasonppy -- I'm curious, if you can spare the details, how exactly did you train the TTS enhanced model compared to the base model? Is it a separate training script? Separate loss? Or simply separate data?

Thanks a lot.

The TTS enhanced model are trained without the first rearrange step introduced in the paper (i.e. no masking)

rlenain commented 5 months ago

Thanks !

rlenain commented 5 months ago

Sorry, actually there is something that I don't understand: is the TTS enhanced model trained from scratch as such, or simply finetuned with that specific objective (i.e. no masking) from the base 830m model? Is there a specific script / recipe that exists in the repo to train/finetune like you trained the TTS enhanced model?

Thanks a lot!

jasonppy commented 5 months ago

they are finetuned from the giga830M/giga330M that's trained with causal masking. Right now the scripts are not uploaded to the repo yet.

Approximetal commented 4 months ago

I tested the TTSEnhanced models, including the 330M and 830M. sometimes it repeats too long, or can't pronounce short words. Maybe we can set some rules to decide when to stop predicting, or add ASR post-processing to check if the pronunciation is correct. image test_sample.zip