lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

Training result #158

Open yiwei0730 opened 1 year ago

yiwei0730 commented 1 year ago

I'd like to inquire about the training results. I have combined datasets AISHELL3, aidata, and a Chinese dataset, totaling 600 hours of training. Although the three audio files are not 24000Hz, I have set cut_set = cut_set.resample(24000) in the line 184 in bin/tokenizer.py, so they should have been converted to 24000Hz. I have followed the document's instructions, using the prefix-1 training method.

python3 bin/trainer.py --world-size 2 --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \ --num-buckets 6 --dtype "bfloat16" --save-every-n 10000 --valid-interval 20000 \ --model-name valle --share-embedding true --norm-first true --add-prenet false \ --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \ --base-lr 0.05 --warmup-steps 200 --average-period 0 \ --num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 4 \ --exp-dir ${exp_dir}

Train NAR model cp ${exp_dir}/best-valid-loss.pt ${exp_dir}/epoch-2.pt # --start-epoch 3=2+1

python3 bin/trainer.py --world-size 2 --max-duration 40 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \ --num-buckets 6 --dtype "float32" --save-every-n 10000 --valid-interval 20000 \ --model-name valle --share-embedding true --norm-first true --add-prenet false \ --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \ --base-lr 0.05 --warmup-steps 200 --average-period 0 \ --num-epochs 40 --start-epoch 3 --start-batch 0 --accumulate-grad-steps 4 \ --exp-dir ${exp_dir} But when using the synthesized audio files and synthesizing with unseen data, the following situations occur:

  1. Often the latter part of the prompt appears at the beginning of the synthesized speech.
  2. Synthesizing long sentences leads to repeated or skipped segments in the latter part of the synthesis. Is there any way to improve these situations?"
lifeiteng commented 12 months ago
  1. 看下 --prefix-mode 2 or 4, --prefix-mode 1 就是会有这个问题
  2. 应该算是 AR 模型的通病了
decajcd commented 2 months ago

请问其他数据集如何做预处理