lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

Failed during inference [SyntaxError: well trained model shouldn't reach here.] #170

Open kin0303 opened 10 months ago

kin0303 commented 10 months ago

I get an error like this:

2023-10-19 10:10:09,510 INFO [infer.py:224] synthesize text: Selamat pagi
2023-10-19 10:10:09,513 WARNING [words_mismatch.py:88] words count mismatch on 500.0% of the lines (5/1)
2023-10-19 10:10:09,516 WARNING [words_mismatch.py:88] words count mismatch on 400.0% of the lines (4/1)
Traceback (most recent call last):
  File "bin/infer.py", line 282, in <module>
    main()
  File "/media/de3fd1ee-a8c4-4153-9cf5-d642327ff6d0/TTS/valle/valle_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "bin/infer.py", line 251, in main
    encoded_frames = model.inference(
  File "/media/de3fd1ee-a8c4-4153-9cf5-d642327ff6d0/TTS/valle/vall-e/valle/models/valle.py", line 1050, in inference
    raise SyntaxError(
SyntaxError: well trained model shouldn't reach here.

how to solve it? I have done AR and NAR training following the information here https://github.com/lifeiteng/vall-e#:~:text=LibriTTS%20demo%20Trained%20on%20one%20GPU%20with%2024G%20memory

zero-or-one commented 10 months ago

It means that AR model could not predict EOS token which implies that it was not trained well. Do you know if this happens with other examples? Btw, does the loss curve of AR training seem ok?

thelinhbkhn2014 commented 8 months ago

It's the same with my problem. When I tested with a short prompt audio (3s or 4s), it was still good. However, the model didn't work or have a bad result. Could you guys help me to fix it?