lucidrains / spear-tts-pytorch

Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
MIT License
249 stars 18 forks source link

Add a trainer for the text-to-semantic model #5

Closed lucasnewman closed 1 year ago

lucasnewman commented 1 year ago

This adds a trainer for the "Stage 1" text-to-semantic model mentioned in the paper (which we'll also need downstream for SoundStorm). I didn't freeze the text embedding, since the paper mentioned:

More specifically, when finetuning on ground-truth parallel data (as an ablation), we freeze both the upper layers of the encoder and the entire decoder, while updating the weights of the encoder embeddings and the lower layers.

And given that the text embedding is maintained separately, I think it's effectively just the initialized/untrained weights from the pretrained model at the start, so training the embedding makes sense to me intuitively as well (I could be thinking about this wrong, though!).

lucidrains commented 1 year ago

oh man, you brought it home

💯 🚀

code is perfect!

lucidrains commented 1 year ago

besides for some logic for managing the generated pseudo labelled dataset, spear-tts is almost done with this PR 🙏

lucasnewman commented 1 year ago

besides for some logic for managing the generated pseudo labelled dataset, spear-tts is almost done with this PR 🙏

If you have any thoughts on the storage / data management for the back translation, I'm open to it. Locally I've just been running the preprocessing steps and writing out torch tensors to disk in the cached data set storage, but if that seems messy we could potentially store it to another user-specified path?

lucidrains commented 1 year ago

besides for some logic for managing the generated pseudo labelled dataset, spear-tts is almost done with this PR 🙏

If you have any thoughts on the storage / data management for the back translation, I'm open to it. Locally I've just been running the preprocessing steps and writing out torch tensors to disk in the cached data set storage, but if that seems messy we could potentially store it to another user-specified path?

oops, missed this! yea, i was thinking of the researcher just specifying path to some folder, and it starts writing to the files memmapped

then one can also point a Dataset class that just accepts the folder path and automatically returns the src and tgt sequences

i think i can take care of this! you've done enough and i should buy you lunch or dinner some time