myshell-ai / MeloTTS

High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
MIT License
4.39k stars 549 forks source link

Different Model training question #160

Open yiwei0730 opened 2 months ago

yiwei0730 commented 2 months ago

I saw that you have released a variety of different language training ckpts. Could you please tell me how to download and train a new language model in the case of base checkpoint.

jeremy110 commented 2 months ago

When you start training, the base model will be downloaded automatically. If you want to train a new language, you can refer to https://github.com/myshell-ai/MeloTTS/issues/120

yiwei0730 commented 2 months ago

Yes, I saw #120 about the use of new Thai. I was testing it myself, for example, using base checkpoint but unable to train Korean. Or even after merging the Korean checkpoint and Base-G.pth for training, it still fails to train. Finally, I found that the only possible problem was emb_g, the training of phoneme embedding. I'm a little curious about how to train this part better in a new language.

Colinsnow1 commented 2 months ago

When you start training, the base model will be downloaded automatically. If you want to train a new language, you can refer to #120

Hello! Have you successfully trained the model for Thai? May I ask which base model is used,and could you share your training logs and the final model?Any feedback will be greatly appreciated 🙏

jeremy110 commented 2 months ago

Isn't this model already supporting Korean?

Based on my experience, if you want to train a Korean language model with your own data, you can take a pre-trained Korean model and then train it with about 6 to 7 hours of data for nearly 300 epochs.

Moreover, if the trained speech is unintelligible, the issue usually occurs in processing phones, tones, or word2ph.

jeremy110 commented 2 months ago

@Colinsnow1

Sorry, I haven't trained Thai, but I have successfully trained my own language using IPA with pretrained/G.pth. I have posted my log in that issue, so you might need to look for it.

Colinsnow1 commented 2 months ago

@jeremy110 Thanks a lot !

yiwei0730 commented 2 months ago

Isn't this model already supporting Korean?

Based on my experience, if you want to train a Korean language model with your own data, you can take a pre-trained Korean model and then train it with about 6 to 7 hours of data for nearly 300 epochs.

Moreover, if the trained speech is unintelligible, the issue usually occurs in processing phones, tones, or word2ph.

I understand what you mean is to use Korean checkpoint as pre-training ckpt

But I want to start training again (similar to training a new language and training the phonemes (enc_p.emb.weight)), but this seems to be very difficult. I don’t know if there is any trick.

jeremy110 commented 2 months ago

Starting training from scratch can be quite challenging as it requires a substantial amount of training data. Additionally, may I ask if you have added new symbols? If so, please refer to this to ensure you are using the pre-trained weights.

yiwei0730 commented 2 months ago

I didn't change anything, I just used 30,000 data for training. But it does seem that enc_p.emb.weight is difficult to train, so I wanted to ask if there are any techniques for training. I originally expected that if a method was found, it would be possible to train new languages. But now it seems that when we add a new language phone, how to trained its phoneme parameters "enc_p.emb.weight" ?

jeremy110 commented 2 months ago

Could you provide your loss curve and your train.list? 30,000 should be enough to achieve good results.

yiwei0730 commented 2 months ago

image train.list example data/KSS/1/10762.wav|KSS1|KR|치마 길이를 조금만 늘리고 싶어요.| ᄎ ᅵ ᄆ ᅡ ᄀ ᅵ ᄅ ᅵ ᄅ ᅳ ᆯ ᄌ ᅩ ᄀ ᅳ ᆷ ᄆ ᅡ ᆫ ᄂ ᅳ ᆯ ᄅ ᅵ ᄀ ᅩ ᄉ ᅵ ᄑ ᅥ ᄋ ᅭ . _|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|1 4 7 8 7 6 1 1 data/KSS/1/10667.wav|KSS1|KR|제 말을 믿으셔도 됩니다.| ᄌ ᅦ ᄆ ᅡ ᄅ ᅳ ᆯ ᄆ ᅵ ᄃ ᅳ ᄉ ᅧ ᄃ ᅩ ᄃ ᅬ ᆷ ᄂ ᅵ ᄃ ᅡ . _|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|1 2 5 4 4 7 1 1

jeremy110 commented 2 months ago

train.list looks normal

Regarding the loss function, the g/mel, g/kl, and g/fm are almost the same, but the g/total is higher than mine by more than ten. What is your g/dur?

Also, could you provide these two audio files for me?

yiwei0730 commented 2 months ago

image KSS.zip Yes the loss is high But I think the reason is the "enc_p.emb.weight" Because I used his KR’s ckpt to successfully train a Korean model. But the training cannot be successful using BASE-CKPT I want to train other languages ​​in the future, I wanted to ask if there is a way to train new languages ​​​​from the Base model. I saw the Thai issue you mentioned before, but it seems that he also encountered the same problem as me, even if the symbol was added, it could not be trained after training.

jeremy110 commented 2 months ago

Indeed, it has a lot to do with enc_p.emb because I am also unsure how KR's checkpoint was trained.

If your method is the same as mine, where the language was converted to IPA during training, I also added a few new symbols and ensured that the phone tone and word-to-phone mappings were correct. With this approach, it is possible to train successfully.

As for Thai, I am not certain if it was ultimately trained successfully, but a lot of time was spent on handling the IPA and word-to-phone mappings.

yiwei0730 commented 2 months ago

I want to confirm your approach and ask what language you use for training. For example, if I want to make a new language model and change the text process method from the current process method to using ipa process , then the training method of the entire model will be executed according to the ipa text process method. Then the training can be successful?

jeremy110 commented 2 months ago

The language I am training is Hokkien, a language spoken in Taiwan.

  1. I use Chinese-wwm to fine-tune my own text.
  2. There is a G2P to convert text into IPA.
  3. Process phone, tone, word2ph.
  4. During preprocessing, unknown symbols are identified.
  5. Add new symbols to symbol.py or replace some existing symbols.
  6. rerun preprocessing
  7. Add tone number and language tags in symbol.py.
  8. Start training.

In my case, the training can be successful.

yiwei0730 commented 2 months ago

I feel that this approach is indeed possible, but I don't know why Korean training will cause enc_p.emb to fail.

Sorry, I don’t know if this will be offensive. I would like to ask you for the code of your Hokkein package. If you can't provide it, that's okay. If you can, I'd be very grateful to you. Thank you.

jeremy110 commented 2 months ago

I might need some time to organize and then upload to GitHub. However, I won't be able to provide the BERT model and G2P parts, but you can still refer to the parts I've modified.

yiwei0730 commented 2 months ago

Thank you very much! By the way, is there anything special about Hokkien’s G2P?

jeremy110 commented 2 months ago

The G2P model was mainly written by other colleagues; I just used it. https://github.com/jeremy110/MeloTTS_hokkien If you see any SP symbols, you can ignore them. I was just experimenting with something else.

yiwei0730 commented 2 months ago

It looks normal! The main thing is to add the symbol and do normal use and training. I have the impression that the Korean acoustics in Bert-vits2 and melo training are broken, which is really strange. I am also considering doing Vietnamese, Hakka, and Hokkien in the future, but I have been stuck on the acoustic training part, so I am a little confused.

jeremy110 commented 2 months ago

Yes, I didn't modify the model part, only did data processing. I'm also not sure how their acoustic models are trained. If you want to work on three languages simultaneously, I think it would be more convenient to use IPA uniformly, and then use a model that has already been trained with a large amount of voice data as the pre-trained model, and continue with fine-tuning.

UltramanSleepless commented 2 months ago

When you start training, the base model will be downloaded automatically. If you want to train a new language, you can refer to #120

I have finished the training process, but when i conduct "python infer.py", i meet issue, :\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for emb_g.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([1, 256]).

jeremy110 commented 2 months ago

@UltramanSleepless It looks like you have modified the number of symbols. Please ensure that the number of symbols in your config.json matches the number of symbols in symbols.py used during training.