open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.45k stars 379 forks source link

[Help]: Is it possible to finetune a pretrained VALL-E model instead of training from scratch? #243

Closed wjddd closed 1 month ago

jiaqili3 commented 1 month ago

Hi @wjddd, yes it is possible to finetune. I'll update some instructions following this thread after I reexamine the code, but in brief, there are some 'resume args' that allows you to do that in the training script. 1. pass the 'resume_type' arg as 'finetune' 2. pass the checkpoint dir to it.

The above are just very brief and I'll provide more info later, thanks!

jiaqili3 commented 1 month ago

To finetune model, you could add some args to the "accelerate launch" in the training scripts. First create a folder where you put your pretrained model weights, and the model should be named "pytorch_model.bin".

Pls try the following command (modify it from the original shell script; not thoroughly tested yet) :

accelerate launch --main_process_port $port "${work_dir}"/bins/tts/train.py --config $exp_config \
--exp_name $exp_name --log_level debug $1 --resume --resume_type finetune --resume_from_ckpt_path "/path/to/checkpoint/folder"
wjddd commented 1 month ago

To finetune model, you could add some args to the "accelerate launch" in the training scripts. First create a folder where you put your pretrained model weights, and the model should be named "pytorch_model.bin".

Pls try the following command (modify it from the original shell script; not thoroughly tested yet) :

accelerate launch --main_process_port $port "${work_dir}"/bins/tts/train.py --config $exp_config \
--exp_name $exp_name --log_level debug $1 --resume --resume_type finetune --resume_from_ckpt_path "/path/to/checkpoint/folder"

I'll give it a try, thx~

LiangTing1 commented 1 month ago

您好,我最近用libritts_360的数据从头开始训练了valle-2,loss已经下降到e-4了,但是infer时ar模型只会输出一个token,infer不出结果。请问您有遇到过这种问题吗?

jiaqili3 commented 1 month ago

您好,我最近用libritts_360的数据从头开始训练了valle-2,loss已经下降到e-4了,但是infer时ar模型只会输出一个token,infer不出结果。请问您有遇到过这种问题吗?

Hi, it seems unusual that your model loss is so low, you might be facing an overfitting problem, which may cause the problem you mentioned. I would suggest you check your dataset, ensure you use the full libritts training set, also make sure the model goes over the whole training set (the progress bar would show the progress, it would not train too quick). In our experiment with MLS dataset we haven't met similar problems yet, welcome to share your progress.

LiangTing1 commented 1 month ago

ensure

我又用了mls的数据集,一共10808037条opus文件,筛选条件还是按照您脚本里面的3到15s,训练了半个epoch,loss都到0.95了,初始loss是7.5。我的配置是6张A800的卡,batch size是8。gradient_accumulation_step是4.另外,使用您提供的valle_ar_mls_196000.bin finetune,是不是还需要对应的优化器模型呢?

jiaqili3 commented 1 month ago

ensure

我又用了mls的数据集,一共10808037条opus文件,筛选条件还是按照您脚本里面的3到15s,训练了半个epoch,loss都到0.95了,初始loss是7.5。我的配置是6张A800的卡,batch size是8。gradient_accumulation_step是4.另外,使用您提供的valle_ar_mls_196000.bin finetune,是不是还需要对应的优化器模型呢?

Just uploaded the ar optimizer, and random_states_0.pkl https://huggingface.co/amphion/valle/blob/main/optimizer_valle_ar_mls_196000.bin https://huggingface.co/amphion/valle/blob/main/random_states_0.pkl

I think your model training is going fine, one note is that the MLS only has 10-20s audios, our filtering for MLS is 10-20s

LiangTing1 commented 1 month ago

很高兴得到您的回复,按照您的建议,mls已经训练出声音啦。但是我最近更改了一下代码里面的g2p,用paddlespeech先把文本转成g2p序列,如“ | AY1 | W AH0 L | R IH0 T ER1 N | W IH1 TH | Y UW1 | AH0 V | K AO1 R S | ”,然后在dataset里面直接phone转id,和您的差异就是去掉了"B I E",但是同样的数据训练出来有时候没有可懂度,有时候读的是prompt_transcript的文本。请问"B I E"的影响会有这么大吗?另外我有用wenetspeech4tts训练了ar模型,输入序列的形式为“ | n i3 | i4 | d ian3 | c uo4 | sh iii1 | b u2 | iong4 | m a5 | ”,基本上没有可懂度。期待您的回复。

jiaqili3 commented 1 month ago

本转成g2p序列,如“ | AY1 | W AH0 L | R IH0 T ER1 N | W IH1 TH | Y UW1 | AH0 V | K AO1 R S | ”,然后在da

Hi, could you specify what's "B I E"? BTW, for Chinese, we haven't tested it using the SpeechTokenizer Codec model, but we have working results with Encodec ("set cfg.use_speechtokenizer=False" to use this). Also test your G2p carefully and make sure the vocabulary size setting match. Note that both AR and NAR should be retrained. gradient_accumulation_step=1 will work.

LiangTing1 commented 1 month ago

您好,感谢您的回复。1、关于“B I E”,在代码“models/tts/valle_v2/g2p_processor.py”中第19行PHONE_SET列表和第294行process函数中出现了“B I E”,您的代码应该是根据音素在单词中的不同位置,给每个音素加上了"B/I/E"的标签;2、我使用MLS英文语料,仅更换G2P,相当于将第19行PHONE_SET列表中“B I E”去掉,例如:原先是"AA0B", "AA0E", "AA0I",去掉“B I E”后仅保留“AA0”,使用这种简化G2P来训练,训练出来的模型,有时候没有可懂度,有时候读的是prompt_transcript的文本;3、您提到的“vocabulary size setting match”,指的是修改json文件中“"model": { "phone_vocab_size": 300, "target_vocab_size": 1024, "pad_token_id": 1324, "bos_target_id": 1325, "eos_target_id": 1326, "bos_phone_id": 1327, "eos_phone_id": 1328, "bos_prompt_id": 1329, "eos_prompt_id": 1330, "num_hidden_layers": 16 }”这部分吧,如果我现在用的phone_vocab_size是100,那么对应需要修改为"model": { "phone_vocab_size": 100, "target_vocab_size": 1024, "pad_token_id": 1124, "bos_target_id": 1125, "eos_target_id": 1126, "bos_phone_id": 1127, "eos_phone_id": 1128, "bos_prompt_id": 1129, "eos_prompt_id": 1130, "num_hidden_layers": 16 }”,我这么理解是对的吧?“phone_vocab_size”是不应该和有效音素个数差别太大吗?代码中还有别的位置,需要修改吗?4、您提到训练过中文模型,可以分享下 中文模型训练使用到的g2p用到的phone_vocab_size是多少吗? 5、另外,我把您代码中音素的"BIE"去掉,训练结果仍然不好,这部分您有什么建议吗。问题有点多,非常感谢

xinkez commented 1 month ago

您好,感谢您的回复。1、关于“B I E”,在代码“models/tts/valle_v2/g2p_processor.py”中第19行PHONE_SET列表和第294行process函数中出现了“B I E”,您的代码应该是根据音素在单词中的不同位置,给每个音素加上了"B/I/E"的标签;2、我使用MLS英文语料,仅更换G2P,相当于将第19行PHONE_SET列表中“B I E”去掉,例如:原先是"AA0B", "AA0E", "AA0I",去掉“B I E”后仅保留“AA0”,使用这种简化G2P来训练,训练出来的模型,有时候没有可懂度,有时候读的是prompt_transcript的文本;3、您提到的“vocabulary size setting match”,指的是修改json文件中“"model": { "phone_vocab_size": 300, "target_vocab_size": 1024, "pad_token_id": 1324, "bos_target_id": 1325, "eos_target_id": 1326, "bos_phone_id": 1327, "eos_phone_id": 1328, "bos_prompt_id": 1329, "eos_prompt_id": 1330, "num_hidden_layers": 16 }”这部分吧,如果我现在用的phone_vocab_size是100,那么对应需要修改为"model": { "phone_vocab_size": 100, "target_vocab_size": 1024, "pad_token_id": 1124, "bos_target_id": 1125, "eos_target_id": 1126, "bos_phone_id": 1127, "eos_phone_id": 1128, "bos_prompt_id": 1129, "eos_prompt_id": 1130, "num_hidden_layers": 16 }”,我这么理解是对的吧?“phone_vocab_size”是不应该和有效音素个数差别太大吗?代码中还有别的位置,需要修改吗?4、您提到训练过中文模型,可以分享下 中文模型训练使用到的g2p用到的phone_vocab_size是多少吗? 5、另外,我把您代码中音素的"BIE"去掉,训练结果仍然不好,这部分您有什么建议吗。问题有点多,非常感谢 方便加一下微信嘛?

Hi @jiaqili3 , Do you have any thoughts on the above question? Thank you in advance.

jiaqili3 commented 1 month ago

Hi, I think changing the g2p would not have very large impact, maybe other codes you changed impacts the performance, I recommend doing a check. If you have questions pls feel free to raise github issues, or through email in the profile, but a quick response is not guaranteed, thanks for understanding!


From: xinkez @.> Sent: Saturday, July 27, 2024 15:05 To: open-mmlab/Amphion @.> Cc: Jiaqi Li (SDS, 120090727) @.>; Mention @.> Subject: Re: [open-mmlab/Amphion] [Help]: Is it possible to finetune a pretrained VALL-E model instead of training from scratch? (Issue #243)

您好,感谢您的回复。1、关于“B I E”,在代码“models/tts/valle_v2/g2p_processor.py”中第19行PHONE_SET列表和第294行process函数中出现了“B I E”,您的代码应该是根据音素在单词中的不同位置,给每个音素加上了"B/I/E"的标签;2、我使用MLS英文语料,仅更换G2P,相当于将第19行PHONE_SET列表中“B I E”去掉,例如:原先是"AA0B", "AA0E", "AA0I",去掉“B I E”后仅保留“AA0”,使用这种简化G2P来训练,训练出来的模型,有时候没有可懂度,有时候读的是prompt_transcript的文本;3、您提到的“vocabulary size setting match”,指的是修改json文件中“"model": { "phone_vocab_size": 300, "target_vocab_size": 1024, "pad_token_id": 1324, "bos_target_id": 1325, "eos_target_id": 1326, "bos_phone_id": 1327, "eos_phone_id": 1328, "bos_prompt_id": 1329, "eos_prompt_id": 1330, "num_hidden_layers": 16 }”这部分吧,如果我现在用的phone_vocab_size是100,那么对应需要修改为"model": { "phone_vocab_size": 100, "target_vocab_size": 1024, "pad_token_id": 1124, "bos_target_id": 1125, "eos_target_id": 1126, "bos_phone_id": 1127, "eos_phone_id": 1128, "bos_prompt_id": 1129, "eos_prompt_id": 1130, "num_hidden_layers": 16 }”,我这么理解是对的吧?“phone_vocab_size”是不应该和有效音素个数差别太大吗?代码中还有别的位置,需要修改吗?4、您提到训练过中文模型,可以分享下 中文模型训练使用到的g2p用到的phone_vocab_size是多少吗? 5、另外,我把您代码中音素的"BIE"去掉,训练结果仍然不好,这部分您有什么建议吗。问题有点多,非常感谢 方便加一下微信嘛?

Hi @jiaqili3https://github.com/jiaqili3 , Do you have any thoughts on the above question? Thank you in advance.

― Reply to this email directly, view it on GitHubhttps://github.com/open-mmlab/Amphion/issues/243#issuecomment-2253858418, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AX6GUH6WCFJTHMJ7DY7BMIDZONBB3AVCNFSM6AAAAABK5VJHN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJTHA2TQNBRHA. You are receiving this because you were mentioned.Message ID: @.***>

jiaqili3 commented 1 month ago

hi @LiangTing1 , so we're planning to improve the model and support Chinese, welcome to contribute together.