scutcsq / Neural-Transducers-for-Two-Stage-Text-to-Speech-via-Semantic-Token-Prediction

Unofficial pytorch reproduction for the paper "Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction" (arXiv:2401.01498)
55 stars 4 forks source link

Any samples from current training ? #1

Open rishikksh20 opened 4 months ago

rishikksh20 commented 4 months ago

Hi @scutcsq , saw your repo and tracking your training, have you able to generate some good quality speech ? Please share some samples or pretrained model if possible. Thanks for code.

scutcsq commented 4 months ago

Hi @rishikksh20 , I still face the problem and I am still working on to fix it. When I fix this problem, I will upload the samples.

rishikksh20 commented 4 months ago

Hi What kind pf problem you are facing I might be help you ?

scutcsq commented 4 months ago

Thank you very much! I found it difficult to generate the comprehensible speech from the transducer model. I have made sure the T2S model works normally. I speculate that the transducer model didn't converge well. No matter if I use k2.rnnt_loss_simple or k2.rnnt_loss_smoothed, the model is always slow to converge, and even after 3 or 4 days of training, the loss function per sample is still high on average (100 or so), I'm not quite sure if this phenomenon is normal. I haven't trained a transducer before, It would be great if you could provide some advice on training transducer models!