lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
2.04k stars 320 forks source link

After 100 epochs training, the model can synthesize natural speech on LibriTTS #58

Open dohe0342 opened 1 year ago

dohe0342 commented 1 year ago

I trained vall-e on LibriTTS about 100 epochs (took almost 4 days on 8 A100 GPUs) and I obtained plausible synthesized audio.

Here is a demo. [1] prompt : prompt_link synthesized audio : synt_link

[2] prompt : prompt_link ground truth : gt_link synthesized audio : synt_link

[3] prompt : prompt_link synthesized audio : synt_link

[4] prompt : prompt_link ground truth : gt_link synthesized audio : synt_link

The model I trained has worse quality than original vall-e because of dataset amount. However, It has a promising quality in clean audio. I'm not sure whether I can share my pre-trained LibriTTS model. If I can, I would like to share the pre-trained LibriTTS model.

sjoon2455 commented 1 year ago

Is 100 epoch ar and nar model each? The code has changed now, so I was wondering :) I have reproduced the training but it seems to have a bit different performance (and mine took for about 1.5 days to train 100 epoch each! on 8 * A100 gpus!)

JonathanColetti commented 1 year ago

@sjoon2455 can you share your tensorboard?

KeiKinn commented 1 year ago

Many of us encountered the missing keys problem when loading the pretrained model. If anyone wants to use the pretrained model provided by @dohe0342, the main trick is that you should checkout to the right commit or any commit with the same valle model, and then reinstall valle by pip uninstall valle; pip install -e . Since when we tried to initialize a new model, python will use the valle installed in env instead of the source code.

raikarsagar commented 1 year ago

@dohe0342 Thanks for sharing the pretrained model which is trained for 100 epochs. When we say, 100 epochs, is it 100 each for AR and NAR or combined numbers where we start with an AR model (probably 50 epochs)? Pls clarify. I have trained a model for 100 epochs but quality isnt as good as shared by you here at the beginning.

Thanks in advance Sagar

nathanodle commented 1 year ago

Have you or has anyone else done further training? Also, which Libre dataset (size) was it? Thanks!

RoyandZoe commented 1 year ago

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : xlwj_sd@163.com

Abdulk084 commented 1 year ago

how

epoch-100.pt

how does epoch-100.pt works with the inferences code provided in this repo as ar.pt and nar.pt are needed?

AI-ctrl commented 1 year ago

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : ajeet9698@gmail.com

AI-ctrl commented 1 year ago

@dohe0342 I'm interested in your pre-training model. Can you share your pre-training model with me? Thank you! my email is : ajeet9698@gmail.com

and will it work if i want to train it on specific set of voices let's say 10 or 15 persons set of voices

liuyuhualilith commented 8 months ago

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01@电子科技大学根@yiwei0730 @hackerxiaobai

抱歉回复晚了。这是我训练的模型。 谷歌驱动器链接:链接

像这样的命令推断: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7我分享了我的预训练模型。这样你就可以合成卡通音频了。但我使用 LibriTTS 训练了我的模型,该模型由 550 小时的人类有声读物组成。最初的 val-e 是在 librilight 上进行训练的,它有 60k 小时的音频。

因此,由于缺乏卡通训练集和数据集数量,我的预训练模型无法合成卡通音频。

Hello! I am interested in your pre-training model. The pre-training weights you posted seem to be invalid. Can you share your pre-training model with me? Thank you!

RafaelJCruz commented 7 months ago

@thangnvkcn @jieen1 @LorenzoBrugioni @UncleSens @Zhang-Xiaoyi @lqj01 @UESTCgan @yiwei0730 @hackerxiaobai

Sorry for late reply. This is the model that I trained. google drive link : link

infer like this command: python bin/infer.py --output-dir ./ --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "KNOT one point one five miles per hour." --audio-prompts ./prompts/8463_294825_000043_000000.wav --text "To get up and running quickly just follow the steps below." --checkpoint exp/epoch-100.pt

@hardik7 I shared my pre-trained model. So you can synthesize the cartoon audio. But I trained my model using LibriTTS which is composed of 550 hours human audiobook. And original vall-e was trained on librilight which has 60k hours audio.

So, my pre-trained model has no capability to synthesize cartoon audio since lack of cartoon train set and lack of dataset amount.

Thanks for sharing, however, this google link has already expired. Could you update a new version? Thanks a lot!

cad-audio commented 5 months ago

@dohe0342 , could you please share the pre-trained model for VALL-E. The google link has expired. If possible, please share us the training script which you have used.

Thanks

zero-or-one commented 4 months ago

Hi,

I trained Vall-E on the LibriTTS dataset, but my model did not converge well. I am sharing the final checkpoint checkpoint link and training curves. Feel free to provide suggestion about further improvement.

train train2

Quick example: prompt : prompt_link synthesized audio : synt_link

RafaelJCruz commented 4 months ago

Dear SA:

Really appreciate that you re-uploaded your pretrained model and data. Thanks a lot!

best wishes Rafael J.

SA @.***> 于2024年7月4日周四 14:09写道:

Hi,

I trained Vall-E on the LibriTTS dataset, but my model did not converge well. I am sharing the final checkpoint checkpoint link https://drive.google.com/file/d/1DoaFjl6iJy4U2qrxVp0Z0QBPJ6lgVQQ0/view?usp=sharing and training curves. Feel free to provide suggestion about further improvement.

train.png (view on web) https://github.com/lifeiteng/vall-e/assets/48153370/d71d0235-fba7-475a-b8b4-a15ea3b3d7e6 train2.png (view on web) https://github.com/lifeiteng/vall-e/assets/48153370/bbdcc87b-42a8-4b6b-8015-87c5c4e78a64

Quick example: prompt : prompt_link https://drive.google.com/file/d/12NfYKrnTZpqj_v7ain39KVtepYNLyjRe/view?usp=sharing synthesized audio : synt_link https://drive.google.com/file/d/1NfySUeibqhA6RJDirrGmV1c_zOagg7vK/view?usp=sharing

— Reply to this email directly, view it on GitHub https://github.com/lifeiteng/vall-e/issues/58#issuecomment-2208185512, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5BUODCT6ZSDFANJFU23EG3ZKTRJ3AVCNFSM6AAAAAAWAW2YZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYGE4DKNJRGI . You are receiving this because you commented.Message ID: @.***>

oush7 commented 3 months ago

@hdmjdp

I ran vall-e last week version which has no prefix option. And I found prefix 0 is same as vall-e last week version version.

Here is my tensorboard image. I ran 177 epochs actually but 100-epoch checkpoint was used to generate audios. image

I'll soon upload tensorboard image. Please wait.

Hi, what kind of loss reduction drawn on the tb graphics? default value is reduction==sum, but loss is very small for sum reduction

seokilee0412 commented 3 months ago

Hi,

I trained Vall-E on the LibriTTS dataset, but my model did not converge well. I am sharing the final checkpoint checkpoint link and training curves. Feel free to provide suggestion about further improvement.

train train2

Quick example: prompt : prompt_link synthesized audio : synt_link

Hi, Thank you for sharing checkpoint. But, I think it is corrupted. I can't load that checkpoint you shared.

zero-or-one commented 2 months ago

Hi, it's trained with the default parameters. I don't know why the loss dropped so low.

@hdmjdp I ran vall-e last week version which has no prefix option. And I found prefix 0 is same as vall-e last week version version. Here is my tensorboard image. I ran 177 epochs actually but 100-epoch checkpoint was used to generate audios. image I'll soon upload tensorboard image. Please wait.

Hi, what kind of loss reduction drawn on the tb graphics? default value is reduction==sum, but loss is very small for sum reduction

zero-or-one commented 2 months ago

Hi, I trained Vall-E on the LibriTTS dataset, but my model did not converge well. I am sharing the final checkpoint checkpoint link and training curves. Feel free to provide suggestion about further improvement. train train2 Quick example: prompt : prompt_link synthesized audio : synt_link

Hi, Thank you for sharing the checkpoint. But, I think it is corrupted. I can't load that checkpoint you shared. Hi, sorry for the corrupted checkpoint. I compressed it recently for space reduction and something happened. I changed the link in the original post. I hope it works.