Open sverdoot opened 6 months ago
I changed the implementation slightly there. The authors use an encoder only transformer and so they needed to add different positional embeddings for the text and speech, while I use a full encoder decoder model (which internally uses different set of pos embeddings). Doing anyway is alright and results depend on your training data and other factors. It's just your preference, I leave it as comment for anyone to use it.
In the original paper the authors suggest adding positional encodings to speech and text representations before the transformer block. I noticed that in your code positional encodings are commented. Have you tried to train model with positional encodings and ,if so, is there any difference in performance?