p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

Positional encoding #16

Open sverdoot opened 6 months ago

sverdoot commented 6 months ago

In the original paper the authors suggest adding positional encodings to speech and text representations before the transformer block. I noticed that in your code positional encodings are commented. Have you tried to train model with positional encodings and ,if so, is there any difference in performance?

p0p4k commented 6 months ago

I changed the implementation slightly there. The authors use an encoder only transformer and so they needed to add different positional embeddings for the text and speech, while I use a full encoder decoder model (which internally uses different set of pos embeddings). Doing anyway is alright and results depend on your training data and other factors. It's just your preference, I leave it as comment for anyone to use it.