myshell-ai / MeloTTS

High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
MIT License
3.98k stars 476 forks source link

Request for PR review: Add support for Thai language #120

Open jadechip opened 2 months ago

jadechip commented 2 months ago

I have created the following PR to add support for Thai language. I am in the process of creating a dataset to train the model but would love a PR review of the code first to make sure I am on the right track.

Thank you!

117

jadechip commented 4 weeks ago

Thank you @jeremy110, can I just clarify I have switched the tones for the "_" characters to zeroes. However I am a bit confused regarding the format of the word2ph list, I should note that I have switched the tokenizer in the g2p method, so the word mapping might be a bit different. Right now it uses the same tokenizer as the get_bert_feature function, which I believe adheres more to the implementation of other languages. As an example, the phrase ใครเป็นผู้รับ would be tokenized into the following chunks tokenized ['▁ใคร', 'เป็นผู้รับ']. This results in a word2ph list that looks like this: [1, 3, 8, 1] where the ones are the underscore characters and the 3 = ใ ค ร, and the 8 = เ ป็ น ผ้ ู ร ั บ. ...and there is 13 tones (one assigned for each phoneme).

tokenized ['▁ใคร', 'เป็นผู้รับ']
Final phs: ['_', 'kʰ', 'r', 'aj', 'p', 'e', 'n', 'pʰ', 'uː', 'r', 'a', 'p̚', '_']
Final tones: [0, 2, 2, 2, 2, 2, 2, 5, 5, 3, 3, 3, 0]
Final word2ph: [1, 3, 8, 1]
len(phones) 13
len(tones) 13

Is this correct?

jeremy110 commented 4 weeks ago

Hello @jadechip I think you can refer to the French section. Below is an example in French, so you can see that word2ph calculates by converting words into their IPA format phones. In it, sə- corresponds to 3, sɛʁvˈis corresponds to 7, ɡʁatyˈi corresponds to 7, and ɛt corresponds to 2.

French: Ce service gratuit est disponible en chinois simplifié et autres 123.
ipa: sə- sɛʁvˈis ɡʁatyˈi ɛt disponˈibl ɑ̃n ʃinwˈa sɛ̃plifjˈe e otʁz sˈɑ̃ vˈɛ̃ tʁwˈa.

phones: ['_', 's', 'ə', '-', 's', 'ɛ', 'ʁ', 'v', 'ˈ', 'i', 's', 'ɡ', 'ʁ', 'a', 't', 'y', 'ˈ', 'i', 'ɛ', 't', 'd', 'i', 's', 'p', 'o', 'n', 'ˈ', 'i', 'b', 'l', 'ɑ', '̃', 'n', 'ʃ', 'i', 'n', 'w', 'ˈ', 'a', 's', 'ɛ', '̃', 'p', 'l', 'i', 'f', 'j', 'ˈ', 'e', 'e', 'o', 't', 'ʁ', 'z', 's', 'ˈ', 'ɑ', '̃', 'v', 'ˈ', 'ɛ', '̃', 't', 'ʁ', 'w', 'ˈ', 'a', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 3, 7, 7, 2, 10, 3, 6, 5, 5, 1, 4, 13, 1, 1]

In your case should be

Final phs: ['_', 'kʰ', 'r', 'aj',       'p', 'e', 'n', 'pʰ', 'uː', 'r', 'a', 'p̚', '_']
Final tones: [0, 2, 2, 2,       2, 2, 2, 5, 5, 3, 3, 3, 0]
Final word2ph: [1, 3,      3, 2, 3, 1]
maryne-ii commented 2 weeks ago

@jadechip Hello there, how were your training results? Are you still struggling with the pronunciation?

jadechip commented 2 weeks ago

Hi @maryne-ii, I believe the pronunciation issues should be resolved, however I am having some issues getting distributed training to work, this is the error I am getting:

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 874668 vs 80644

My understanding is the gloo library is used by PyTorch for collective communication in distributed training, so perhaps the error indicates some kind of mismatch between expected and actual sizes during TCP communication, but I am not sure what in my code - if anything - is causing this...

I should also note I have been using PyTorch > 2.x to train as I was getting other CUDA errors similar to #96.

Training on a single GPU seems to work though.

jadechip commented 3 days ago

I've had some time to continue working on this and was able to resolve the training issues. I believe the inconsistencies were caused by the tokenizer I was using. I have now changed to a tokenizer that more aligns with the format expected by the codebase. The format is close to what @jeremy110 suggested apart from the underscore characters in the output. I am not sure if I should remove the underscore characters in the tokenized text before calculating the phs, tones and word2ph values? I am concerned it might cause inconsistencies with the get_bert_feature function later in the pipeline?

The tokenized text ['▁', 'กง', 'ล้อ']
Final phs: ['_', '▁', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
Final tones: [0, 2, 2, 2, 2, 3, 3, 0]
Final word2ph: [1, 1, 3, 2, 1]
bert features shape torch.Size([768, 8])
jeremy110 commented 3 days ago

I think it can be removed because it duplicates the original underscores.

jadechip commented 3 days ago

Ok, thank you, I will give this a shot.