I trained a Chinese model, and when synthesizing long speech, the effect may deteriorate, even with pronunciation errors. Why is this？

lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html

https://lifeiteng.github.io/valle/index.html

Apache License 2.0

1.99k stars 320 forks source link

Closed zhiCharon closed 1 year ago

lifeiteng commented 1 year ago

inference shoud be consist with training.