Project the input to the dimension of the transformer

This untangles the size of the transformer and the number of bins used in the spectrogram. I believe this is consistent with the paper based on these lines in Section 3.2:

The architecture incorporated U-Net style skip connections, 24 layers, 16 attention heads, an embedding dimension of 1024, a linear layer dimension of 4096, and a dropout rate of 0.1. [...] We modeled the 100-dimensional log mel-filterbank features, [...]

and it gives us the ability to scale the computation power of the transformer more easily.

lucidrains / e2-tts-pytorch

Project the input to the dimension of the transformer #11