After 100 epochs training, the model can synthesize natural speech on LibriTTS

dohe0342 commented 1 year ago

I trained vall-e on LibriTTS about 100 epochs (took almost 4 days on 8 A100 GPUs) and I obtained plausible synthesized audio.

Here is a demo. [1] prompt : prompt_link synthesized audio : synt_link

[2] prompt : prompt_link ground truth : gt_link synthesized audio : synt_link

[3] prompt : prompt_link synthesized audio : synt_link

[4] prompt : prompt_link ground truth : gt_link synthesized audio : synt_link

The model I trained has worse quality than original vall-e because of dataset amount. However, It has a promising quality in clean audio. I'm not sure whether I can share my pre-trained LibriTTS model. If I can, I would like to share the pre-trained LibriTTS model.

hdmjdp commented 1 year ago

@dohe0342 prefix 0？ and can you share your config?

dohe0342 commented 1 year ago

@hdmjdp

I can't understand prefix. What does it mean?

Here is the shell script i ran. I just changed the "num-epochs", "max-duration" and "world-size".

./run.sh --stage 4 --stop-stage 4 --max-duration 50 --filter-max-duration 14 --num-decoder-layers 12 --world-size 8 --num-epochs 100

hdmjdp commented 1 year ago

@dohe0342

The mode for how to prefix VALL-E NAR Decoder, " -- 54 | "0: no prefix, 1: 0 to random, 2: random to random.",

hdmjdp commented 1 year ago

@dohe0342 can you share your tensorboard image?

dohe0342 commented 1 year ago

@hdmjdp

I ran vall-e last week version which has no prefix option. And I found prefix 0 is same as vall-e last week version version.

Here is my tensorboard image. I ran 177 epochs actually but 100-epoch checkpoint was used to generate audios.

I'll soon upload tensorboard image. Please wait.

dohe0342 commented 1 year ago

@hdmjdp

Here is my tensorboard log. tensorboard

thangnvkcn commented 1 year ago

@dohe0342 Can you share the pre-trained LibriTTS model for me, if possible please send it to me at thangmta30@gmail.com

hdmjdp commented 1 year ago

@hdmjdp Can you share the pre-trained LibriTTS model for me, if possible please send it to me at thangmta30@gmail.com

Not me

hdmjdp commented 1 year ago

@hdmjdp

Here is my tensorboard log. tensorboard

Thanks. Whether the prompt speaker of your demo wav in your training data?

dohe0342 commented 1 year ago

@hdmjdp

The prompt speakers are in test-clean not training data.

lifeiteng commented 1 year ago

@dohe0342 Thank you for sharing this.

shanhaidexiamo commented 1 year ago

based on the latest commit? Thanks

dohe0342 commented 1 year ago

based on the latest commit? Thanks based on last week commit. thank you

liuxun666 commented 1 year ago

mark

jieen1 commented 1 year ago

@dohe0342 can you share this model for me? wangjiashejieen@gmail.com here is my email. Thanks.

LorenzoBrugioni commented 1 year ago

Hey @dohe0342 , great work! Would you think it could be possible to share the pre-trained model? 🙏🏻🙏🏻🙏🏻
Just in case, here's my email : lori.brugio@gmail.com

UncleSens commented 1 year ago

Thank you for your contribution @dohe0342! In case it's possible to share the model, would you please send it to me? Here is my email: senqiu37@gmail.com

Zhang-Xiaoyi commented 1 year ago

@dohe0342 Very nice results! Can you share your trained model if it is possible? my email is zhangxiaoyi1127@gmail.com

lqj01 commented 1 year ago

@dohe0342 Very nice results! Can you share your trained model if it is possible? my email is liqianjin2018@gmail.com

WendongGan commented 1 year ago

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : 15982350806@163.com

yiwei0730 commented 1 year ago

I'm so interested in your pre-training model. The result was amazing, can you share the pretrained model with me ? I would be very appreciated. my email : yiwei110181@gmail.com

hackerxiaobai commented 1 year ago

Very nice results! Can you share your trained model if it is possible? my email is wl_9322@163.com

hardik7 commented 1 year ago

@dohe0342 Interesting results! Could you please try synthesising audio from a cartoon character's audio prompt, something like this: https://drive.google.com/file/d/11NDZzopniwIFJa8dr4hAKp2md8cxel4w/view?usp=sharing Curious to know how would VALL-E's output sound like with non-human like voices. Thanks!

dohe0342 commented 1 year ago

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained. google drive link : link

infer like this command: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

Zhang-Xiaoyi commented 1 year ago

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained. google drive link : link

infer like this command: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

Thanks for sharing. I have trained a model using the same config as yours. I just checked the ckpt at 30 epoch and it produces quite good results. I will compare with your ckpt at 100 epoch.

hardik7 commented 1 year ago

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained. google drive link : link

infer like this command: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

Thank you @dohe0342. Will do some experiments with non-human voices and will train my own model with the relevant dataset.

OnceJune commented 1 year ago

@dohe0342 hi, have you evaluated the inference speed? what's the RTF when generate audio? How about the correctness of the pronunciation?

bprimal22 commented 1 year ago

@dohe0342 Is it possible to train on top of your trained model?

lifeiteng commented 1 year ago

@hdmjdp

I ran vall-e last week version which has no prefix option. And I found prefix 0 is same as vall-e last week version version.

Here is my tensorboard image. I ran 177 epochs actually but 100-epoch checkpoint was used to generate audios.

I'll soon upload tensorboard image. Please wait.

@dohe0342 It should be --prefix-mode 1, can you test --prefix-mode 1 on branch stage #59

zhouyong64 commented 1 year ago

@dohe0342 Could you share this file "unique_text_tokens.k2symbols"? It's need for inference.

cwjacklin commented 1 year ago

"unique_text_tokens.k2symbols" is in valle\egs\libritts\data\tokenized

dustinjoe commented 1 year ago

"unique_text_tokens.k2symbols" is in valle\egs\libritts\data\tokenized

Running into the same issue when running inference with the model. Thanks.

dustinjoe commented 1 year ago

Sorry for interrupting. When trying the checkpoint above, running into the following error:

Traceback (most recent call last):
  File "/balboa/users/xingyu/vall-e/infer.py", line 281, in <module>
    main()
  File "/work/users/c86420b/.conda/envs/faster_whisper_py39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/balboa/users/xingyu/vall-e/infer.py", line 149, in main
    missing_keys, unexpected_keys = model.load_state_dict(
  File "/work/users/c86420b/.conda/envs/faster_whisper_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VALLE:
        Missing key(s) in state_dict: "ar_text_embedding.word_embeddings.weight", "nar_text_embedding.word_embeddings.weight", "ar_audio_embedding.word_embeddings.weight", "nar_audio_embeddings.0.word_embeddings.weight", "nar_audio_embeddings.1.word_embeddings.weight", "nar_audio_embeddings.2.word_embeddings.weight", "nar_audio_embeddings.3.word_embeddings.weight", "nar_audio_embeddings.4.word_embeddings.weight", "nar_audio_embeddings.5.word_embeddings.weight", "nar_audio_embeddings.6.word_embeddings.weight", "nar_audio_embeddings.7.word_embeddings.weight", "ar_text_position.alpha", "ar_audio_position.alpha", "nar_text_position.alpha", "nar_audio_position.alpha", "ar_predict_layer.weight", "nar_predict_layers.0.weight", "nar_predict_layers.1.weight", "nar_predict_layers.2.weight", "nar_predict_layers.3.weight", "nar_predict_layers.4.weight", "nar_predict_layers.5.weight", "nar_predict_layers.6.weight", "nar_stage_embeddings.0.word_embeddings.weight", "nar_stage_embeddings.1.word_embeddings.weight", "nar_stage_embeddings.2.word_embeddings.weight", "nar_stage_embeddings.3.word_embeddings.weight", "nar_stage_embeddings.4.word_embeddings.weight", "nar_stage_embeddings.5.word_embeddings.weight", "nar_stage_embeddings.6.word_embeddings.weight". 
        Unexpected key(s) in state_dict: "text_embedding.word_embeddings.weight", "ar_embedding.word_embeddings.weight", "nar_embeddings.0.word_embeddings.weight", "nar_embeddings.1.word_embeddings.weight", "nar_embeddings.2.word_embeddings.weight", "nar_embeddings.3.word_embeddings.weight", "nar_embeddings.4.word_embeddings.weight", "nar_embeddings.5.word_embeddings.weight", "nar_embeddings.6.word_embeddings.weight", "nar_embeddings.7.word_embeddings.weight", "text_position.alpha", "audio_positions.0.alpha", "audio_positions.1.alpha", "audio_positions.2.alpha", "audio_positions.3.alpha", "audio_positions.4.alpha", "audio_positions.5.alpha", "audio_positions.6.alpha", "audio_positions.7.alpha", "stage_embeddings.0.word_embeddings.weight", "stage_embeddings.1.word_embeddings.weight", "stage_embeddings.2.word_embeddings.weight", "stage_embeddings.3.word_embeddings.weight", "stage_embeddings.4.word_embeddings.weight", "stage_embeddings.5.word_embeddings.weight", "stage_embeddings.6.word_embeddings.weight", "stage_embeddings.7.word_embeddings.weight", "predict_layers.0.weight", "predict_layers.1.weight", "predict_layers.2.weight", "predict_layers.3.weight", "predict_layers.4.weight", "predict_layers.5.weight", "predict_layers.6.weight", "predict_layers.7.weight".

Any suggestions? Thanks.

nate-gillman commented 1 year ago

@dohe0342: howdy!! thanks for sharing your weights. Had a question: from your tensorboard, it looks like the model started overfitting around 1/3 of the way through training. Was wondering if you saw some improvements in the TTS synthesis even after the val curve stopped showing that the model was learning?

catalwaysright commented 1 year ago

Did you find the solution? I got the same error.

codehappy-net commented 1 year ago

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are:

python: 3.8.0 torch: 1.13.1+cu116 numpy: 1.22.4

no-Seaweed commented 1 year ago

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are:

python: 3.8.0 torch: 1.13.1+cu116 numpy: 1.22.4

Do you mean you switch your previous dependency to torch 1.13.1+cu116 and etc, and the missing key problem is solved?

catalwaysright commented 1 year ago

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are: python: 3.8.0 torch: 1.13.1+cu116 numpy: 1.22.4

Do you mean you switch your previous dependency to torch 1.13.1+cu116 and etc, and the missing key problem is solved?

I solved this error when I switch the rep version to v0.1.0. It can infer successfully but the output is nothing but noise. Could you @dohe0342 please reveal the code zip you used to train this model? Thanks in advance.

no-Seaweed commented 1 year ago

I received similar errors and had to install earlier versions of Python and torch to resolve them. It's just Python ML dependency hell; breaking API changes occur from version to version for little real reason. The versions of the big dependencies installed in my working VALL-E conda environment are: python: 3.8.0 torch: 1.13.1+cu116 numpy: 1.22.4

Do you mean you switch your previous dependency to torch 1.13.1+cu116 and etc, and the missing key problem is solved?

I solved this error when I switch the rep version to v0.1.0. It can infer successfully but the output is nothing but noise. Could you @dohe0342 please reveal the code zip you used to train this model? Thanks in advance.

Finally, I find the corresponding version of code and was able to produce a correct output. Please find b83653a1a2d756e80d26858a00101e26df656b86.

eschmidbauer commented 1 year ago

i cannot run inference with the pretrained model provided, i get the following error:

RuntimeError: Error(s) in loading state_dict for VALLE:
    Missing key(s) in state_dict: "ar_text_embedding.word_embeddings.weight", "nar_text_embedding.word_embeddings.weight", "ar_audio_embedding.word_embeddings.weight", "ar_text_position.alpha", "ar_audio_position.alpha", "ar_predict_layer.weight", "nar_audio_embeddings.0.word_embeddings.weight", "nar_audio_embeddings.1.word_embeddings.weight", "nar_audio_embeddings.2.word_embeddings.weight", "nar_audio_embeddings.3.word_embeddings.weight", "nar_audio_embeddings.4.word_embeddings.weight", "nar_audio_embeddings.5.word_embeddings.weight", "nar_audio_embeddings.6.word_embeddings.weight", "nar_audio_embeddings.7.word_embeddings.weight", "nar_text_position.alpha", "nar_audio_position.alpha", "nar_predict_layers.0.weight", "nar_predict_layers.1.weight", "nar_predict_layers.2.weight", "nar_predict_layers.3.weight", "nar_predict_layers.4.weight", "nar_predict_layers.5.weight", "nar_predict_layers.6.weight", "nar_stage_embeddings.0.word_embeddings.weight", "nar_stage_embeddings.1.word_embeddings.weight", "nar_stage_embeddings.2.word_embeddings.weight", "nar_stage_embeddings.3.word_embeddings.weight", "nar_stage_embeddings.4.word_embeddings.weight", "nar_stage_embeddings.5.word_embeddings.weight", "nar_stage_embeddings.6.word_embeddings.weight".
    Unexpected key(s) in state_dict: "text_embedding.word_embeddings.weight", "ar_embedding.word_embeddings.weight", "nar_embeddings.0.word_embeddings.weight", "nar_embeddings.1.word_embeddings.weight", "nar_embeddings.2.word_embeddings.weight", "nar_embeddings.3.word_embeddings.weight", "nar_embeddings.4.word_embeddings.weight", "nar_embeddings.5.word_embeddings.weight", "nar_embeddings.6.word_embeddings.weight", "nar_embeddings.7.word_embeddings.weight", "text_position.alpha", "audio_positions.0.alpha", "audio_positions.1.alpha", "audio_positions.2.alpha", "audio_positions.3.alpha", "audio_positions.4.alpha", "audio_positions.5.alpha", "audio_positions.6.alpha", "audio_positions.7.alpha", "stage_embeddings.0.word_embeddings.weight", "stage_embeddings.1.word_embeddings.weight", "stage_embeddings.2.word_embeddings.weight", "stage_embeddings.3.word_embeddings.weight", "stage_embeddings.4.word_embeddings.weight", "stage_embeddings.5.word_embeddings.weight", "stage_embeddings.6.word_embeddings.weight", "stage_embeddings.7.word_embeddings.weight", "predict_layers.0.weight", "predict_layers.1.weight", "predict_layers.2.weight", "predict_layers.3.weight", "predict_layers.4.weight", "predict_layers.5.weight", "predict_layers.6.weight", "predict_layers.7.weight".

RuntimeRacer commented 1 year ago

@eschmidbauer you'll need to checkout the commit referenced in the comment https://github.com/lifeiteng/vall-e/issues/58#issuecomment-1519036360 right above yours.

eschmidbauer commented 1 year ago

thanks! i'll give that a try, i was actually able to continue training with the model though.

etwk commented 1 year ago

Thanks for sharing the model.

Tried the epoch, could generate a similar style of audio when text prompts and output text are all short. Will result in the warning below when either text prompts or output text are longer:

WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1).

What's the proper way to generate longer audio?

debasishaimonk commented 1 year ago

@dohe0342 on how many distinct speakers that u trained on, which you have shared your trained model?

cantabile-kwok commented 1 year ago

Thanks for sharing the model.

Tried the epoch, could generate a similar style of audio when text prompts and output text are all short. Will result in the warning below when either text prompts or output text are longer:

WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1).

What's the proper way to generate longer audio?

Not only this warning keeps to exist, but also the generated audio misses, babbles or repeats a lot of words if the text is relatively long. Is it because of 100 epochs on LibriTTS are still not able to learn well?

chenjiasheng commented 1 year ago

i think the warning comes from the 3rd party tokenizer which is safe to ignore. @lifeiteng can you confirm this?

About the lacking performance on long text, i guess it is because most of the utterances in the train dataset is short, the model had never seen a long text.

To alleviate it, I suggest you finetune the model on longer utterances like librilight. Looking forward to your experiment result if you have time to so it.

---Original--- From: @.> Date: Tue, May 23, 2023 15:59 PM To: @.>; Cc: @.***>; Subject: Re: [lifeiteng/vall-e] After 100 epochs training, the model cansynthesize natural speech on LibriTTS (Issue #58)

Thanks for sharing the model.

Tried the epoch, could generate a similar style of audio when text prompts and output text are all short. Will result in the warning below when either text prompts or output text are longer:

WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1).

What's the proper way to generate longer audio?

Not only this warning keeps to exist, but also the generated audio misses, babbles or repeats a lot of words if the text is relatively long. Is it because of 100 epochs on LibriTTS are still not able to learn well?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

cantabile-kwok commented 1 year ago

So, did you encounter this long waveform issue as well @chenjiasheng? Actually my test sentences come from LibriTTS test set, and it is supposed to have the same length distribution on the train set. I found that, using the checkpoint released in this thread, the model can hardly generate 100% correct speech if it is longer than something around 12s. Technically that length should not be regarded as "very long" usually. Does it mean that when training this model, most of the data longer than 12s are dropped by some way?

lifeiteng commented 1 year ago

@cantabile-kwok more info about words count mismatch https://github.com/lifeiteng/vall-e/issues/5

chenjiasheng commented 1 year ago

So sorry that i missed your reply, Hope it is not too late. By default audios longer than 14 second are truncated, because long audios are very RAM inefficient. You can try change that argument named something like filter-max.

---Original--- From: @.> Date: Tue, May 23, 2023 18:23 PM To: @.>; Cc: @.**@.>; Subject: Re: [lifeiteng/vall-e] After 100 epochs training, the model cansynthesize natural speech on LibriTTS (Issue #58)

So, did you encounter this long waveform issue as well @chenjiasheng? Actually my test sentences come from LibriTTS test set, and it is supposed to have the same length distribution on the train set. I found that, using the checkpoint released in this thread, the model can hardly generate 100% correct speech if it is longer than something around 12s. Technically that length should not be regarded as "very long" usually. Does it mean that when training this model, most of the data longer than 12s are dropped by some way?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

temporaryharry commented 1 year ago

Could anyone send me the "unique_text_tokens.k2symbols" because it is not in valle\egs\libritts\data\tokenized without training my email is grandmaskisses342@gmail.com

lifeiteng / vall-e

After 100 epochs training, the model can synthesize natural speech on LibriTTS #58