A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS
145
stars
19
forks
source link
Question about Differentiable Duration Modeling #4
I opened this issue to get advice on implementation DDM.
My Implementation of Differentiable Alignment Encoder outputs attention like thing from noise input.
But the training speed of DDM is too slow(10s/iter). Seems like it hanged in backward progress.
Can anyone give me some advice to improve the speed of recursive tensor operation?
Should I use cuda.jit like Soft DTW? Or is there something wrong with the approach itself?
The module's output from noise input and code is like below.
Hello, I'm trying to implement Differentiable Duration Modeling(DDM) module introduced in Differentiable Duration Modeling for End-to-End Text-to-Speech.
I opened this issue to get advice on implementation DDM.
My Implementation of Differentiable Alignment Encoder outputs attention like thing from noise input. But the training speed of DDM is too slow(10s/iter). Seems like it hanged in backward progress.
Can anyone give me some advice to improve the speed of recursive tensor operation? Should I use cuda.jit like Soft DTW? Or is there something wrong with the approach itself?
The module's output from noise input and code is like below.
Thank you.
L Q S = soft attention Duration
Code