This untangles the size of the transformer and the number of bins used in the spectrogram. I believe this is consistent with the paper based on these lines in Section 3.2:
The architecture incorporated U-Net style skip connections, 24 layers, 16 attention heads, an embedding dimension of 1024, a linear layer dimension of 4096, and a dropout rate of 0.1. [...] We modeled the 100-dimensional log mel-filterbank features, [...]
and it gives us the ability to scale the computation power of the transformer more easily.
This untangles the size of the transformer and the number of bins used in the spectrogram. I believe this is consistent with the paper based on these lines in Section 3.2:
The architecture incorporated U-Net style skip connections, 24 layers, 16 attention heads, an embedding dimension of 1024, a linear layer dimension of 4096, and a dropout rate of 0.1. [...] We modeled the 100-dimensional log mel-filterbank features, [...]
and it gives us the ability to scale the computation power of the transformer more easily.