Add a trainer & demo notebook for the speech-only pretraining task

lucidrains / spear-tts-pytorch

Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch

MIT License

254 stars 19 forks source link

Add a trainer & demo notebook for the speech-only pretraining task #3

Closed lucasnewman closed 1 year ago

lucasnewman commented 1 year ago

I cribbed a bunch of this from SemanticTransformerTrainer in audiolm-pytorch and added a notebook to demonstrate it works. The loss converges at a 60% mask rate as used in the paper on a subset of LibriTTS. I'm happy to make changes, just let me know!

lucidrains commented 1 year ago

Great job Lucas! I'll take a look later this week; I'm about to dive back into the TTS field in August and finish a bunch of repos.

lucidrains commented 1 year ago

@lucasnewman are you doing this for work? for a company in SF perhaps?

lucasnewman commented 1 year ago

@lucasnewman are you doing this for work? for a company in SF perhaps?

Yep, it's part of an exploration I'm doing for work and also just advancing my understanding of the SOTA along the way.

lucidrains commented 1 year ago

@lucasnewman cool, maybe my dog and I will run into you :laughing: we live in the mission

lucidrains commented 1 year ago

@lucasnewman which company do you work for? just curious if it is yet another TTS company (been contacted by like 3 so far)

lucasnewman commented 1 year ago

@lucasnewman which company do you work for? just curious if it is yet another TTS company (been contacted by like 3 so far)

Ha, not at all, I work for Future (you will probably be confused 😅). I'm over in Noe so not too far away!

lucidrains commented 1 year ago

@lucasnewman haha yea i am confused :laughing: you automating the personal trainer with some deep fake? nice! Vaswani lives in Noe Valley haha (great neighborhood)

lucasnewman commented 1 year ago

@lucasnewman haha yea i am confused 😆 you automating the personal trainer with some deep fake? nice! Vaswani lives in Noe Valley haha (great neighborhood)

It's not really to replace the humans, but more around personalizing the other audio aspects of what we do — I find the vast majority of deep fakes still fully in the uncanny valley. I see Vaswani at La Lucha on Sanchez all the time, although I don't know him. Small world for sure!

lucasnewman commented 1 year ago

I'm going to close this one in favor of https://github.com/lucidrains/spear-tts-pytorch/pull/4, since that has all the changes here and more support for backtranslation. 🙏