lucidrains / e2-tts-pytorch

Implementation of E2-TTS, "Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS", in Pytorch
MIT License
228 stars 21 forks source link

Project the input to the dimension of the transformer #11

Closed lucasnewman closed 1 month ago

lucasnewman commented 1 month ago

This untangles the size of the transformer and the number of bins used in the spectrogram. I believe this is consistent with the paper based on these lines in Section 3.2:

The architecture incorporated U-Net style skip connections, 24 layers, 16 attention heads, an embedding dimension of 1024, a linear layer dimension of 4096, and a dropout rate of 0.1. [...] We modeled the 100-dimensional log mel-filterbank features, [...]

and it gives us the ability to scale the computation power of the transformer more easily.