myshell-ai / MeloTTS

High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
MIT License
4.87k stars 636 forks source link

Request for PR review: Add support for Thai language #120

Open jadechip opened 6 months ago

jadechip commented 6 months ago

I have created the following PR to add support for Thai language. I am in the process of creating a dataset to train the model but would love a PR review of the code first to make sure I am on the right track.

Thank you!

117

tchayintr commented 6 months ago

Great job! I planed to work on training a Thai TTS model using MeloTTS too.

Zengyi-Qin commented 6 months ago

Hi - Thanks for the contribution. We would suggest you first train on the Thai dataset to see if the code works. We haven't had any attempt to train on Thai

jadechip commented 6 months ago

@Zengyi-Qin Sounds good, will report back once I have proper training results.

jadechip commented 6 months ago

Thank you @tchayintr, if you have any recommendations for Thai audio datasets, I would greatly appreciate it!

tchayintr commented 6 months ago

@jadechip Sure! There are several datasets such as TSync2, Lotus, etc. You can check several of them here: https://github.com/korakot/corpus/releases/tag/v1.0 with documentation at https://lexitron.nectec.or.th/KM_HL5001/file_HL5001/Document/krrn_14518.pdf.

There are also Thai dialects available at https://github.com/SLSCU/thai-dialect-corpus.

However, I recommend collecting clear voice clips and crafting their transcriptions with ASR tools like WhisperX. This way, you can generate a lot of samples, but you may need to fine-tune it for the Thai language 😄.

I am reviewing your commits too. They mostly look great 🎆 , but I found some points that need to be clarified. I will clarify and let you know if there is a point that may need to be adjusted in terms of Thai linguistic knowledge.

jadechip commented 6 months ago

@tchayintr this is super helpful and any feedback you have for my code will be greatly appreciated 🙏 I was also looking at this other nectec dataset: https://github.com/vistec-AI/dataset-releases/releases/tag/v1 I'll work on creating transcriptions next and report back.

jadechip commented 6 months ago

@Zengyi-Qin are there any additional steps or files needed before training? I am getting the following error:

output

⚡ add-thai ~/MeloTTS/melo torchrun --nproc_per_node=1 --master_port=10902 train.py --c data/thai/config.json --model thai
2024-05-07 15:24:58.152 | INFO     | data_utils:_filter:64 - Init dataset...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141910/141910 [00:04<00:00, 32864.77it/s]
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:84 - min: 65; max: 987
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:85 - skipped: 327, total: 141910
buckets: [92994, 31326, 11604, 4350, 1068, 156, 84, 24]
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2024-05-07 15:25:02.699 | INFO     | data_utils:_filter:64 - Init dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32832.13it/s]
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:84 - min: 164; max: 625
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:85 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:01<?, ?it/s]
Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 194, in __getitem__
    return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 98, in get_audio_text_speaker_pair
    bert, ja_bert, phones, tone, language = self.get_text(
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 180, in get_text
    raise
RuntimeError: No active exception to reraise

...it seems to happen around line 200 in train.py

image

config.json

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 6,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "data/thai/train.list",
    "validation_files": "data/thai/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "TH-default": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 9,
  "num_tones": 17,
  "symbols": [
    "_",
    "\"",
    "(",
    ")",
    "*",
    "/",
    ":",
    "AA",
    "E",
    "EE",
    "En",
    "N",
    "OO",
    "Q",
    "V",
    "[",
    "\\",
    "]",
    "^",
    "a",
    "a:",
    "aa",
    "ae",
    "ah",
    "ai",
    "an",
    "ang",
    "ao",
    "aw",
    "ay",
    "b",
    "by",
    "c",
    "ch",
    "d",
    "dh",
    "dy",
    "e",
    "e:",
    "eh",
    "ei",
    "en",
    "eng",
    "er",
    "ey",
    "f",
    "g",
    "gy",
    "h",
    "hh",
    "hy",
    "i",
    "i0",
    "i:",
    "ia",
    "ian",
    "iang",
    "iao",
    "ie",
    "ih",
    "in",
    "ing",
    "iong",
    "ir",
    "iu",
    "iy",
    "j",
    "jh",
    "k",
    "ky",
    "l",
    "m",
    "my",
    "n",
    "ng",
    "ny",
    "o",
    "o:",
    "ong",
    "ou",
    "ow",
    "oy",
    "p",
    "py",
    "q",
    "r",
    "ry",
    "s",
    "sh",
    "t",
    "th",
    "ts",
    "ty",
    "u",
    "u:",
    "ua",
    "uai",
    "uan",
    "uang",
    "uh",
    "ui",
    "un",
    "uo",
    "uw",
    "v",
    "van",
    "ve",
    "vn",
    "w",
    "x",
    "y",
    "z",
    "zh",
    "zy",
    "~",
    "æ",
    "ç",
    "ð",
    "ø",
    "ŋ",
    "œ",
    "ɐ",
    "ɑ",
    "ɒ",
    "ɔ",
    "ɕ",
    "ə",
    "ɛ",
    "ɜ",
    "ɡ",
    "ɣ",
    "ɥ",
    "ɦ",
    "ɪ",
    "ɫ",
    "ɬ",
    "ɭ",
    "ɯ",
    "ɲ",
    "ɵ",
    "ɸ",
    "ɹ",
    "ɾ",
    "ʁ",
    "ʃ",
    "ʊ",
    "ʌ",
    "ʎ",
    "ʏ",
    "ʑ",
    "ʒ",
    "ʝ",
    "ʲ",
    "ˈ",
    "ˌ",
    "ː",
    "̃",
    "̩",
    "β",
    "θ",
    "ก",
    "ข",
    "ฃ",
    "ค",
    "ฅ",
    "ฆ",
    "ง",
    "จ",
    "ฉ",
    "ช",
    "ซ",
    "ฌ",
    "ญ",
    "ฎ",
    "ฏ",
    "ฐ",
    "ฑ",
    "ฒ",
    "ณ",
    "ด",
    "ต",
    "ถ",
    "ท",
    "ธ",
    "น",
    "บ",
    "ป",
    "ผ",
    "ฝ",
    "พ",
    "ฟ",
    "ภ",
    "ม",
    "ย",
    "ร",
    "ล",
    "ว",
    "ศ",
    "ษ",
    "ส",
    "ห",
    "ฬ",
    "อ",
    "ฮ",
    "ะ",
    "ั",
    "า",
    "ำ",
    "ิ",
    "ี",
    "ึ",
    "ื",
    "ุ",
    "ู",
    "เ",
    "แ",
    "โ",
    "ใ",
    "ไ",
    "ๅ",
    "็",
    "่",
    "้",
    "์",
    "๐",
    "๑",
    "๒",
    "๓",
    "๔",
    "๕",
    "๖",
    "๗",
    "๘",
    "๙",
    "ᄀ",
    "ᄁ",
    "ᄂ",
    "ᄃ",
    "ᄄ",
    "ᄅ",
    "ᄆ",
    "ᄇ",
    "ᄈ",
    "ᄉ",
    "ᄊ",
    "ᄋ",
    "ᄌ",
    "ᄍ",
    "ᄎ",
    "ᄏ",
    "ᄐ",
    "ᄑ",
    "ᄒ",
    "ᅡ",
    "ᅢ",
    "ᅣ",
    "ᅤ",
    "ᅥ",
    "ᅦ",
    "ᅧ",
    "ᅨ",
    "ᅩ",
    "ᅪ",
    "ᅫ",
    "ᅬ",
    "ᅭ",
    "ᅮ",
    "ᅯ",
    "ᅰ",
    "ᅱ",
    "ᅲ",
    "ᅳ",
    "ᅴ",
    "ᅵ",
    "ᆨ",
    "ᆫ",
    "ᆮ",
    "ᆯ",
    "ᆷ",
    "ᆸ",
    "ᆼ",
    "ㄸ",
    "!",
    "?",
    "…",
    ",",
    ".",
    "'",
    "-",
    "¿",
    "¡",
    "SP",
    "UNK"
  ]
}
jadechip commented 6 months ago

Nevermind, I was able to pinpoint the issue, I didn't realize you needed to add the language code here as well:

image

I've updated my PR with the missing code. I seems like it is training correctly now although I am still getting some warnings/exceptions:

Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:00<?, ?it/s][W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/autograd/__init__.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 9, 96], strides() = [99168, 96, 1]
bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Evaluating ...
Evauate done
  0%|▍                                                                                                                                           | 74/23601 [03:24<11:00:36,  1.68s/it]min value is  tensor(-1.1265)

Will try to run the complete training loop on some H100s 🤞

acul3 commented 6 months ago

hello @jadechip let me know if its working

i'm training for indonesia and malay language

changing phonem and bert also

after 10 epoch the model doesnt produce any good word, only some noise , some random vowel

my data ~200hours dataset ~500 speaker

jeremy110 commented 6 months ago

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

jadechip commented 6 months ago

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

Thank you @jeremy110. If I understand correctly in melo/models.py, we should first initialize the TextEncoder with the original 219, in order to use the retrained weights, like this:

// models.py
        self.enc_p = TextEncoder(
            219,  # Initialize with the original symbol size
            inter_channels,
            hidden_channels,
            filter_channels,
            n_heads,
            n_layers,
            kernel_size,
            p_dropout,
            gin_channels=self.enc_gin_channels,
            num_languages=num_languages,
            num_tones=num_tones,
        )

...then right after add a check if the n_vocab (len(symbols)) has a different size, and if so update the self.enc_p.embed_tokens with the resized embeddings?

if n_vocab != 219:
    old_embeddings = self.enc_p.emb
    new_num_tokens = n_vocab
    self.enc_p.emb = self.get_resized_embeddings(old_embeddings, new_num_tokens)

Does that look correct to you? Note: I've updated my PR to reflect this.

jeremy110 commented 6 months ago

hello~ @jadechip

Yes, it looks fine as it is.

However, in symbols.py, you'll need to make some modifications. If you place your new symbol inside the sorted list and then use the method above, it may result in some symbols having weights that don't match up with the original model. So, I suggest you do it like this.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + new_symbols # add new symbols here 
jadechip commented 6 months ago

I see, thank you for the heads up @jeremy110 🙏 I've updated my code to reflect your suggestion, now I have.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + th_symbols
sil_phonemes_ids = [symbols.index(i) for i in pu_symbols]

# combine all tones
num_tones = num_zh_tones + num_ja_tones + num_en_tones + num_kr_tones + num_es_tones + num_fr_tones + num_de_tones + num_ru_tones + num_th_tones

# language maps
language_id_map = {"ZH": 0, "JP": 1, "EN": 2, "ZH_MIX_EN": 3, 'KR': 4, 'ES': 5, 'SP': 5, 'FR': 6, 'TH': 7}
num_languages = len(language_id_map.keys())

I'll try running a new training job to evaluate performance with these changes.

acul3 commented 6 months ago

thanks @jadechip and @jeremy110

i'll try it to my environment also,see if works

jadechip commented 6 months ago

Ok, I was able to run a training job for around 9k steps yesterday. I tried running inference using the new checkpoint, but it seems to produce unintelligible sounds. I think the learning rate looks ok though? ...so I will try ramping up the batch size and training for longer on multiple GPUs and report back with my results 🤞 For reference here is my current config and Tensorboard metrics.

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 16,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "../Data/locutor/train.list",
    "validation_files": "../Data/locutor/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "locutor": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 1,
  "num_tones": 16,
  "symbols": [
...
Screenshot 2567-05-10 at 23 57 48 Screenshot 2567-05-10 at 23 57 41 Screenshot 2567-05-10 at 23 57 00 Screenshot 2567-05-10 at 23 56 54
jadechip commented 6 months ago

btw I am currently training on a subset of Thai commonvoice 13, converted to .wav with a sample rate of 48 kHz. Edit: Happy weekend everyone 🎉

jeremy110 commented 6 months ago

hello~ @jadechip

My config is basically the same as yours, except my batch size is 6. Perhaps you can increase your learning rate to 9e-4 and see how it performs. Also, I've added a constraint to the clip_grad_value in the code.

grad_norm_d = commons.clip_grad_value_(net_d.parameters(), 200)
grad_norm_g = commons.clip_grad_value_(net_g.parameters(), 500)

Finally, I'm attaching my tensorboard for reference. (https://drive.google.com/drive/folders/1xPNURmWsmJqwEDHVM8ZsK6CAbuv65ipI?usp=sharing)

Additionally, if the silence before and after your audio files is shorter, your g/dur will converge to a smaller value, which will also affect the length of the silence before and after the inference.

I'm not sure if the Thai CommonVoice 13 dataset is suitable for training. Also, there's no need to specifically convert it to 48kHz. I remember that the code will resample it. I think you can start by testing whether it can be trained with 10 hours of data from one person.

I hope this is helpful for you.

jadechip commented 6 months ago

Thank you for you sharing! Your advice has been super helpful @jeremy110 🙏

jadechip commented 6 months ago

Hmm trained for longer with different hyperparameters but so far the results are not much better, something might be wrong with my code.

acul3 commented 6 months ago

yeah me too

longer training,,the voice is clearer and similar, but cant pronounce a single word

maybe phenomizer problem ,idk

jeremy110 commented 6 months ago

hello @jadechip @acul3 I'd like to confirm something. Are all your tones set to 0? Because I made a similar mistake before where I treated tones like ˧ ˦ as phones, but they should correspond to tones. Here's an example of what I did before.

#error
phones: ['_', 'k', 'e', 'ʔ', '˧', 'p', 'i', 'a', 'ʔ', '˧', 'ʦ', 'ʰ', 'i', 'n', '˦', '˦', 'k', 'e', '˦', '˦', ',', 'l', 'e', '˥', '˧', 's', 'ɔ', '˨', '˩', 'g', 'u', 'a', 'n', '˩', '˧', 'ʦ', 'a', 'i', '˧', '˧', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 4, 5, 6, 4, 1, 4, 4, 6, 5, 1, 1]
#correct
phones: 28 ['_', 'k', 'e', 'ʔ', 'p', 'i', 'a', 'ʔ', 'ʦ', 'ʰ', 'i', 'n', 'k', 'e', ',', 'l', 'e', 's', 'ɔ', 'g', 'u', 'a', 'n', 'ʦ', 'a', 'i', '.', '_']
tones: [0, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 0, 2, 2, 3, 3, 5, 5, 5, 5, 7, 7, 7, 0, 0]
word2ph: [1, 3, 4, 4, 2, 1, 2, 2, 4, 3, 1, 1]
acul3 commented 6 months ago

@jeremy110 yes all my tone are set to 0

now wondering how can i fix this

jeremy110 commented 6 months ago

hello~ @acul3 @jadechip Sorry, I spent some time looking at that, but since I can't read Thai, I did some online research. I wanted to ask about the symbols from line 266 to 339 in the th_symbols . Are those symbols not IPA?

Also, I looked at the Wiktionary file and found several symbols that seem to represent tones: ˧, ˨˩, ˦˥, ˩˦, and ˥˩. It looks like there are five tones. So, you need to convert these symbols into tones and then add the corresponding number of tones to the 'tones' list based on the number of phones in your phone list.

But I'm confused about lines 5908 to 5910. Which one is correct?

jadechip commented 6 months ago

@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list. I've pushed some changes to the g2p function which hopefully addresses this:

def g2p(norm_text):
    tokenized = tokenizer.tokenize(norm_text)
    phs = []
    word2ph = []
    current_word = []
    current_phonemes = []

    for token in tokenized:
        if token.startswith("▁"):  # Start of a new word
            if current_word:
                word_phonemes = " ".join(current_phonemes)
                phs.extend(word_phonemes.split())
                word2ph.append(len(current_phonemes))
                current_word = []
                current_phonemes = []
            current_word.append(token.replace("▁", ""))
        else:
            current_word.append(token)

        if token in punctuation or token in pu_symbols:
            phs.append(token)
            word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(token.replace("▁", ""))
            current_phonemes.extend(phonemes.split())

    if current_word:
        word_phonemes = " ".join(current_phonemes)
        phs.extend(word_phonemes.split())
        word2ph.append(len(current_phonemes))

    # Distribute phonemes to match the number of tokens
    distributed_word2ph = []
    for i, group in enumerate(tokenized):
        if group.startswith("▁"):
            group = group.replace("▁", "")
        if group in punctuation or group in pu_symbols:
            distributed_word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(group)
            distributed_word2ph.append(len(phonemes.split()))

    tone_markers = ['˥', '˦', '˧', '˨', '˩']
    phones = ["_"] + [re.sub(f'[{"".join(tone_markers)}]', '', p) for p in phs] + ["_"]  # Remove tone markers from phones
    tones = extract_tones(phs)  # Extract tones from the original phs list
    word2ph = [1] + distributed_word2ph + [1]

    assert len(word2ph) == len(tokenized) + 2

    return phones, tones, word2ph

def extract_tones(phones):
    tones = []
    tone_map = {
        "˥": 5,  # High tone
        "˦": 4,  # Rising tone
        "˧": 3,  # Mid tone
        "˨": 2,  # Falling tone
        "˩": 1,  # Low tone
    }

    for phone in phones:
        tone = 0
        for marker, value in tone_map.items():
            if marker in phone:
                tone = value
                break
        tones.append(tone)

    return tones

TLDR;

...I've also updated the test following test case:

def test_g2p():
    text = "ฉันรักเมืองไทย"
    normalized_text = text_normalize(text)
    phones, tones, word2ph = g2p(normalized_text)
    assert phones == ['_', 't͡ɕʰ', 'a', 'n', '', 'r', 'a', 'k̚', '', 'm', 'ɯa̯', 'ŋ', '', 'tʰ', 'aj', '', '.', 'j', 'a', '', '.', '_']
    assert tones == [0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 5, 0]
    assert word2ph == [1, 0, 8, 12, 1]

I think this output makes sense as the output is now similar to yours.

The phones list contains the phonemes corresponding to the input text, excluding the tone markers. The mapping of tone markers to numeric values seems accurate (4 for ˩˩˦, 5 for ˦˥, 3 for ˧).

The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:

1: Start-of-sequence token
0: No phonemes for the first token (likely punctuation or special symbol)
8: Number of phonemes for the second token ("ฉันรัก")
12: Number of phonemes for the third token ("เมืองไทย")
1: End-of-sequence token
jadechip commented 6 months ago

About the Thai symbols, the characters from line 266 to 339 are the characters of the Thai alphabet, including numbers. The remaining lines (340 - 406) were characters that I copied from the Wiktionary file (which I got from here https://github.com/PyThaiNLP/thai-g2p-wiktionary-corpus/tree/main), I am not sure if I should include them in this file (symbols.py) but if I remember correctly I was getting an error if I didn't include them.

jadechip commented 6 months ago

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

tchayintr commented 6 months ago

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

jeremy110 commented 6 months ago
text: 禮          數
ipa: l e ˥ ˧      s ɔ ˨˩
phones: ['_', 'l', 'e',     's', 'ɔ', '_']
tones: [0, 2, 2,       3, 3, 0]
word2ph: [1, 2,      2, 1]

Perhaps I misled you a bit. Let me clarify using an example. For '˥ ˧' in my case, it corresponds to 2. Then, with two phones, 'l' and 'e', so the tones correspond to two 2. For '˩' in my case, it corresponds to 3. Then, with two phones, 's' and 'ɔ', so the tones correspond to two 3.

jeremy110 commented 6 months ago

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

Because I don't know Thai at all, I can't help with the g2p part. sorry

tchayintr commented 6 months ago

@jeremy110 Don't worry, this is not your fault at all!

We are here to discuss and find a solution.

I will keep you updated if I got something. @jadechip @jeremy110

BankNatchapol commented 6 months ago

This one is my implementation (not optimized 😞) for the @jeremy110 's tones. I also trying to change @jadechip 's phonemizer which sometime return the characters not phonemes. https://gist.github.com/BankNatchapol/1276e34dcb51c521536978859dd948cd But, the problem now is the phonemizer is not that good. It's usually give repeated phonemes for some reason.

jadechip commented 6 months ago

@jeremy110 if I understand correctly, I am extracting the tones from the original phs list and assigning them to the phones in a one-to-one manner, but you are saying a single tone marker should be assigned to multiple phones based on the number of phones associated with that tone marker?

I've pushed some code that tries to solve this issue and I've added a new test cases. However I am still experiencing some strange behavior and I'm afraid I am reaching the limits of my knowledge of the Thai language as well 😔

jeremy110 commented 6 months ago

@jadechip I spent some time finding a few words from the Wiktionary file to use as examples because I noticed that the processing of Thai is different from mine, so let me give another example.

กง  k o ŋ ˧    -> 3 phones + 1 tone
ล้อ l ɔː ˦˥        -> 2 phones + 1 tone
กงล้อ   k o ŋ ˧ . l ɔː ˦˥ 

suppose tone map: ˧ -> 2,  ˦˥ -> 3
text: กงล้อ -> กง   ล้อ
phones: [ '_', 'k', 'o', 'ŋ',    'l', 'ɔː', '_']
tones: [ 1, 2, 2, 2,      3, 3, 1]  # 2 copied three times (3 phones), 3 copied twice (2 phones)

Here, I'll ignore the "." for now because I can't figure out what it represents.

BankNatchapol commented 6 months ago

@jadechip I spent some time finding a few words from the Wiktionary file to use as examples because I noticed that the processing of Thai is different from mine, so let me give another example.

กง    k o ŋ ˧    -> 3 phones + 1 tone
ล้อ   l ɔː ˦˥        -> 2 phones + 1 tone
กงล้อ k o ŋ ˧ . l ɔː ˦˥ 

suppose tone map: ˧ -> 2,  ˦˥ -> 3
text: กงล้อ -> กง   ล้อ
phones: [ '_', 'k', 'o', 'ŋ',    'l', 'ɔː', '_']
tones: [ 1, 2, 2, 2,      3, 3, 1]  # 2 copied three times (3 phones), 3 copied twice (2 phones)

Here, I'll ignore the "." for now because I can't figure out what it represents.

"." Is just a delimiter for each syllable. For example: Pronunciation of สวัสดี is สะ-หวัด-ดี So phoneme of สวัสดี is sa˨˩ (สะ) . wat̚˨˩ (หวัด) . diː˧ (ดี)

jeremy110 commented 6 months ago

@BankNatchapol Thank you for your explanation. So is the example I gave correct?

Additional: If "." is just a delimiter for each syllable, then it does not need to be included in the phones, as it could be confused with the period in English and affect pauses during sentence segmentation.

tchayintr commented 6 months ago

Note that the phoneme column in the Wiktionary file includes some graphemes with two phonetic transcriptions separated by a comma. This can happen when a word has multiple accepted pronunciations or when the pronunciation can change based on context or regional accents.

For example, ไอดอล (ไอ=i, ดอล=dol) means "idol," where its phonemes include ʔaj˧.dɔl˥˩, ʔaj˧.dɔn˥˩ --> ʔaj˧.dɔl˥˩ (i-dol) or ʔaj˧.dɔn˥˩ (i-don).

jadechip commented 6 months ago

Thank you all for the insightful feedback. I have pushed some changes and added another test case:

def test_g2p():
    # Test case for the word "กงล้อ"
    text = "กงล้อ"
    normalized_text = text_normalize(text)
    phones, tones, word2ph = g2p(normalized_text)

    # Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 5, 1]

The TLDR is I now separate the syllables based on ".", then extract the tones and assign values based on this map:

tone_map = {
    "˧": 2,  # Mid tone
    "˨˩": 1,  # Low tone
    "˦˥": 3,  # Rising tone
    "˩˩˦": 4,  # Falling tone
    "˥˩": 5,  # High tone
}

For the tones list, it is calculated similarly to the method @jeremy110 described above, i.e for the word "กงล้อ" results in the following phonemes: ['k', 'o', 'ŋ', '˧', '.', 'l', 'ɔː', '˦˥']. ...as we can see the first 3 phonemes = 'k', 'o', 'ŋ' have an associated tone of '˧' which corresponds to 2, and the remains 2 phonemes = 'l', 'ɔː' has a tone of '˦˥' which corresponds to 3, therefore we get a list that looks like this [0, 2, 2, 2, 3, 3, 0]. Note: that the zeroes at the beginning and end represents the "_" special character. Another note: If no tone marker is found in a group, a default tone value of 2 (mid tone) is assigned to all the phonemes in that group, i.e these two have the same tone:

กง  k o ŋ
กง  k o ŋ ˧

As for the word2ph list, it represents the number of phonemes for each word in the input text, including special tokens. So using our previous example, we get [1, 5, 1] where the ones are special characters and the 5 represents 'k', 'o', 'ŋ', 'l', 'ɔː'. Note that the word2ph list has a length equal to the number of words plus 2 (for the special tokens).

jadechip commented 6 months ago

I should note however that there are surely edge cases and I am not entirely sure if the get_bert_feature function is correct. As always, any feedback and support is very much appreciated 🙏 Happy weekend everyone.

jeremy110 commented 6 months ago

Happy weekend ~~@jadechip There are still some things that need to be modified

# Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 3, 2, 1] # modified --> 3 mean 3 phones(koŋ), 2 mean 2 phones(l ɔ)
wannaphong commented 6 months ago

I have a Thai TTS dataset that is open-source dataset. https://huggingface.co/datasets/lunarlist/edited_common_voice

BankNatchapol commented 6 months ago

I have a Thai TTS dataset that is open-source dataset. https://huggingface.co/datasets/lunarlist/edited_common_voice

Any recommendation for fixing G2P? The pythainlp.transliterate.transliterate("สามารถ", engine="thaig2p") usually gives repeated phonemes.

wannaphong commented 6 months ago

I have a Thai TTS dataset that is open-source dataset. https://huggingface.co/datasets/lunarlist/edited_common_voice

Any recommendation for fixing G2P? The pythainlp.transliterate.transliterate("สามารถ", engine="thaig2p") usually gives repeated phonemes.

Can you try https://huggingface.co/pythainlp/thaig2p-v2.0?

tchayintr commented 6 months ago

The issue has become a super long discussion! 😃

@jadechip

I should note however that there are surely edge cases and I am not entirely sure if the get_bert_feature function is correct. As always, any feedback and support is very much appreciated 🙏 Happy weekend everyone.

I did a quick check and test on your commits (https://github.com/jadechip/MeloTTS/commit/ffd8f4174cd1ff5d4cbe2ad8bc5e923de8455e75#diff-8f6f83dc5d5f83888cfad03f6835561fd38ec675fd0b5a07b3911ed38d786487).

I hope there was no mistake from my environment.

I found that it could not pass the assertion assert inputs["input_ids"].shape[-1] == len(word2ph), f"{inputs['input_ids'].shape[-1]}/{len(word2ph)}"

I followed the chinese_mix.py, chinese_bert.py. It seems they use character-based tokens mostly for Chinese while using word-based tokens (IMK) for English, so there is no problem when extracting phonemes because it aligns well with the tokenizer from hfl/chinese-roberta-wwm-ext-large

For Thai, we need to be careful about the size of inputs and word2ph since each Thai BERT-based tokenizer can yield different tokens.

For example,

text = "กงล้อ"     

 Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 5, 1]

The tokenizer should tokenize "กงล้อ" into 1 token (+2 special tokens).

On the other hand,

# Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 3, 2, 1] # modified --> 3 mean 3 phones(koŋ), 2 mean 2 phones(l ɔ)

The tokenizer should tokenize "กงล้อ" into 2 tokens, กง and ล้อ (+2 special tokens)

Ultimately, I think if we handle it properly (e.g., using the tokenizer before G2P), the get_bert_feature would be fine.

Plus, I am trying to make a rule to combine an initial, vowel, and final phones properly like they did in chinese_mix.py. The result will look like:

text = "Today ฉันกินข้าว Good เป็นไง"
expected_phones = ['_', 't', 'ah', 'd', 'ey', 'ch', 'an', 'k', 'in', 'kh', 'aaw', 'g', 'uh', 'd', 'p', 'en', 'N', 'aj', _]
expected_tones = [0, 7, 8, 7, 9, 5, 5, 1, 1, 3, 3, 7, 9, 7, 1, 1, 1, 1, 0]
rev_expected_tones = [0, 7, 8, 7, 9, 19, 19, 15, 15, 17, 17, 7, 9, 7, 15, 15, 15, 15, 0]  # language_tone_start_map['TH']
expected_word2ph = [1, 4, 2, 2, 2, 3, 2, 2, 1]

Just as an example, please omit the English tones here.

I still need to deal with the alignment of word2ph and tokens tokenized from BERT. If my method works, I will update.

Update:

I still need to deal with the alignment of word2ph and tokens tokenized from BERT. If my method works, I will update.

I convert each word (or a syllable) from text into a token id for encoding. However, this is not the best solution since an unseen word will become an UNK token id. By the way, I use my Thai BERT tokenizer/encoder containing around 100k words and it has been pre-trained for Thai token classification. At least, it should lessen the unseen word issue a bit.

jadechip commented 6 months ago

Thank you all for your valuable feedback! It's great to see such active collaboration, showcasing the strength of the open source community 💪

Apologies @tchayintr, the assertion was indeed failing. I've pushed some code which should resolve the issue, and the output format should now be correct:

BERT tokens: ['▁', 'กง', 'ล้อ']
Aligning word: กงล้อ
Word phonemes: ['k', 'o', 'ŋ', 'l', 'ɔː']
Word tones: [2, 2, 2, 3, 3]
Final phs: ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
Final tones: [1, 2, 2, 2, 3, 3, 1]
Final word2ph: [1, 3, 2, 1]

Thank you for highlighting your proposed approach, please feel free to run some tests or add changes to my code, I am still not 100% confident that everything is working perfectly, but I am eager to try another training job soon to see if the quality has improved 😄

@BankNatchapol ~~do you mean we should replace the g2p function with another model? I am not sure how much better https://huggingface.co/pythainlp/thaig2p-v2.0 would be as it is trained on the same dictionary file that I am using (wiktionary-23-7-2022-clean.tsv). Or just let me know if my understanding is incorrect.~~ Ah nevermind, rereading your reply and I realize you are talking about the pythainlp.tokenize word_tokenize(norm_text, engine="newmm") tokenizer.

tiebay004 commented 6 months ago

Test results of fine-tuning the Thai text-to-speech model https://www.youtube.com/watch?v=7sApLg5l2Ps

tchayintr commented 6 months ago

Test results of fine-tuning the Thai text-to-speech model https://www.youtube.com/watch?v=7sApLg5l2Ps

Thank you for the example, @tiebay004

May I know more details about the libraries or models you used? Coqui?

Honestly, sorry to mention it, but the result doesn't seem as smooth as other languages published on MeloTTS. I wonder if it is related to MeloTTS? 😭

We are looking forward to the day when the smoothness of Thai TTS is comparable to major languages. However, I feel MeloTTS might have a chance.

tiebay004 commented 6 months ago

Since I had never heard of MeloTTS before, I chose to fine-tune with Coqui. I will try fine-tuning with MeloTTS, but the training takes a long time, and I am using my son's PC for training because he has a GPU with high VRAM.

Since it's the school holidays in Thailand, I should be able to experiment more. As I haven't worked in data before, I might not be able to answer many technical questions about the model. I used to be a developer several years ago.

jadechip commented 5 months ago

Hello everyone. I've been a bit busy lately but I've pushed some changes that I think should fix a lot of the outstanding issues. However I am running into some CUDA issues trying to get the training code to run, which wasn't happening before.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I think this means the code is trying to access elements of a tensor that are outside its valid range but I am not sure where this is happening exactly. Using pdf I was able to pinpoint it down to the forward method of the TextEncoder class in models.py, specifically the line: self.language_emb(language) seems to trigger the error, but I am still not sure why as the language here is defined with a shape similar to the input x.

For added context, this is how my train.list file looks like right now.

/workspace/commonvoice/common_voice_th_25686299.wav|TH-default|TH|อนาคตของการทำงานคือมนุษย์หุ่นยนต์ที่เพิ่มขึ้น|_ ʔ a n aː kʰ o t̚ kʰ ɔː ŋ ก า ร ท ำ ง า น kʰ ɯː m a n u t̚ h u n j o n tʰ iː เ พ ิ ่ ม ข ึ ้ น _|1 1 1 2 2 3 3 3 4 4 4 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 1 1 1 2 2 2 5 5 2 2 2 2 2 2 2 2 2 1|1 10 8 2 5 6 2 9 1
/workspace/commonvoice/common_voice_th_26600610.wav|TH-default|TH|ผู้ประกอบอาชีพ|_ pʰ uː p r a k ɔː p̚ ʔ aː t͡ɕʰ iː p̚ _|1 5 5 1 1 1 1 1 1 2 2 5 5 5 1|1 8 5 1
/workspace/commonvoice/common_voice_th_25686917.wav|TH-default|TH|ฉันหวังว่าทักษะของคุณจะเกาขึ้น|_ t͡ɕʰ a n w a ŋ ว ่ า tʰ a k̚ s aʔ kʰ ɔː ŋ kʰ u n c aʔ k eː า kʰ ɯ n _|1 4 4 4 4 4 4 2 2 2 3 3 3 1 1 4 4 4 2 2 2 2 2 2 2 2 5 5 5 1|1 8 1 5 6 2 2 1 3 1
/workspace/commonvoice/common_voice_th_26665439.wav|TH-default|TH|อันหนึ่งอันเดียวกัน|_ ʔ a n n ɯ ŋ ʔ a n d ia̯w k a n _|1 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1|1 6 3 5 1
jadechip commented 5 months ago

@jeremy110 should I update the number of tones in symbols.py to 5 as Thai has 5 tones? num_th_tones = 1 The reason I am asking is because I see most other languages have the tone set to 1, so I am not sure if it has any impact.

jeremy110 commented 5 months ago

Hello @jadechip Yes, you should update your number of tones to 5.

There are some issue in your train.list:

  1. The _ at the beginning and end should correspond to tone 0.
  2. The tones and word2ph do not match up with our previous discussion. To illustrate with my example: three 8's, three 1's, three 2's, two 7's, one 5 ...
    /path/audio.wav|F1|TAI|一杯走味的咖啡,|_ ʦ i t p u e ʦ a u b i e k a p i , _|0 8 8 8 1 1 1 2 2 2 7 7 5 1 1 1 1 0 0|1 3 3 3 2 1 2 2 1 1