roedoejet / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
21 stars 7 forks source link

Error on adding a new language #4

Closed hadarishav closed 1 year ago

hadarishav commented 1 year ago

Hi, thanks for your great work.

I am using your code base to train the system for a low resource Indian language written in devnagri script.

I follow all the steps you mention in “adding a new language”

I build my own lexicon from the data and convert it English Arpabet using G2P.

The lexicon looks like:

I use this lexicon to validate and generate textgrids from MFA.

On using the FastSpeech2 training script I get the following error:

Training:   0%|                                                                                                   | 0/30000 [00:00<?, ?it/sPrepare training ...                                                                                                  | 0/65 [00:00<?, ?it/s]
Number of FastSpeech2 Parameters: 35076161
Removing weight norm...
Traceback (most recent call last):
  File "/home/t-rishavhada/mundari/FastSpeech2/train.py", line 259, in <module>
    main(args, configs)
  File "/home/t-rishavhada/mundari/FastSpeech2/train.py", line 97, in main
    output = model(*(batch[2:]))
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/t-rishavhada/mundari/FastSpeech2/model/fastspeech2.py", line 116, in forward
    output = self.encoder(texts, src_masks)
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/t-rishavhada/mundari/FastSpeech2/transformer/Models.py", line 116, in forward
    enc_output = self.src_word_emb(src_seq) + self.position_enc[
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x2 and 36x256)

I think the error originates from the symbols.py script.

I added my language there in Mappings and other places:

MAPPINGS = {
    "git": {"norm": make_g2p("git", "git-equiv"), "ipa": make_g2p("git", "git-ipa")},
    "moh": {"norm": make_g2p("moh", "moh-equiv"), "ipa": make_g2p("moh", "moh-ipa")},
    "str": {"norm": make_g2p("str", "str-equiv"), "ipa": make_g2p("str", "str-ipa")},
    "hin": {"hipa": make_g2p("hin", "hin-ipa"), "ipa": make_g2p("hin-ipa", "eng-ipa")},

}

I am unable to figure out the problem. Do you have any suggestions?

Thanks in advance.

roedoejet commented 1 year ago

Hi @hadarishav - very cool! So have you forked the https://github.com/roedoejet/g2p library and made your own mappings for hin -> hin-ipa? Or are you using some other g2p library? Is the hin-ipa to eng-ipa actually going to English Arpabet or to English IPA symbols?

If you are training with phonological features (in config/model.yaml this would be transformer.spe_features=true), then your mapping will need to convert to IPA and not Arpabet as panphon uses IPA. You could instead set this to false and train on one-hots embeddings from the Arpabet representation.

Can you provide a link to your full symbols.py, or, even better, to your cloned version? You're right that it looks like the size of the text input tensor is not right.

roedoejet commented 1 year ago

Looking at the default config, I actually realized it's not optimal for easy adaptation to new languages, so I've updated it here: https://github.com/roedoejet/FastSpeech2/commit/b1140532d927d77cd707a3c8da624b4cf9db9a1e you could try pulling this change or making the same changes in your config/model.yaml and training again. Unless you really want to train with phonological features instead of character embeddings in which case you'll have to follow my notes above. Let me know how it goes!

hadarishav commented 1 year ago

Thanks for your response.

Yes, I use https://github.com/roedoejet/g2p and add my hin to hin-ipa mappings.

The hin-ipa to eng-ipa goes to English IPA symbols. Note that the lexicon I have created however is from hin to English arpabet, following the sample you have.

Here’s a link to the cloned repo.

I made the changes in the config and tried with both spe_features=True and False. The training script still does not work.

I think there’s something wrong in how I am adding the symbols in symbols.py file.

roedoejet commented 1 year ago

Thanks for this - it's helpful to see your repo. So, the symbols that you use to do forced alignment must be the same symbols that you give as input to the model. For languages other than English, I actually recommend not using Arpabet and just using your language's IPA. To make the changes necessary in your repo:

  1. Change your lexicon to use the hin to hin-ipa mapping, for example mine (for another language) looks like this (ie uses IPA, not Arpabet):
g̲aniwila ɢ æ n ɪ w ɪ l æ
hagun h æ ɟ u n
  1. Run MFA on the data using a lexicon built from your hin to hin-ipa mapping.
  2. Change your MAPPINGS constant in symbols.py to the following (i.e. to the same hin to hin-ipa mapping):
MAPPINGS = {
    "git": {"norm": make_g2p("git", "git-equiv"), "ipa": make_g2p("git", "git-ipa")},
    "moh": {"norm": make_g2p("moh", "moh-equiv"), "ipa": make_g2p("moh", "moh-ipa")},
    "str": {"norm": make_g2p("str", "str-equiv"), "ipa": make_g2p("str", "str-ipa")},
    "hin": {"ipa": make_g2p("hin", "hin-ipa")},
}
  1. Change the CHARS constant in symbols.py to: CHARS = SYMBOLS['hin']
  2. Preprocess again and make sure your train.txt and val.txt files are using the same hin-ipa symbols. For example, my file looks like (basename|language|speaker|symbols|raw text):
git0095|git|ap-git|{cʼ ɪ jˀ ɬ c æ p sp æ ɬ χ s b ɪ d ɪ s u m æ c s}|k'i'yhl kap ahl x̲sbidixsxum aks.
git0162|git|ap-git|{nˀ ɪ ɬ ɢ æ b iː t}|'nihl g̲abiit.

I would still set spe_features=False otherwise you might need to adjust the features.py file to suit your language.

These changes will make it so that you are using the same symbol set throughout the model. Note, just a heads up, there is an expectation that you will have a normalization mapping (MAPPINGS['hin']['norm']), but if you don't, you will need to remove lines 105 and 109 in synthesize.py (that reference g2p_norm) to do inference with a trained model later on.

hadarishav commented 1 year ago

This is great. Thanks a lot for going through my repo and taking the time to write this. I will try this out and hopefully it'll work.

hadarishav commented 1 year ago

Thanks for the suggestions. I made the changes. Seems like the encoder is working now. I am getting another error:

  File "/home/t-rishavhada/mundari/FastSpeech2/train.py", line 259, in <module>
    main(args, configs)
  File "/home/t-rishavhada/mundari/FastSpeech2/train.py", line 97, in main
    output = model(*(batch[2:]))
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/t-rishavhada/mundari/FastSpeech2/model/fastspeech2.py", line 149, in forward
    ) = self.variance_adaptor(
  File "/anaconda/envs/mundari/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/t-rishavhada/mundari/FastSpeech2/model/modules.py", line 140, in forward
    x = x + pitch_embedding
RuntimeError: The size of tensor a (367) must match the size of tensor b (368) at non-singleton dimension 1

would you know what's leading to the size mismatch and how can I fix it?

roedoejet commented 1 year ago

Yes, that is usually an error caused by a character getting skipped. That is, there is likely a character that was present in your alignment process that was not defined in your symbols or mapping. I recommend either writing a script to check your data for the mismatch or setting a try/except with a breakpoint at that spot and then inspecting the filename. Does that make sense? Sorry I'm not at my computer and it's hard to type code snippets on my phone 😅

roedoejet commented 1 year ago

Note that a common culprit is punctuation or incorrect Unicode normalization, which you might want to add to your cleaners if you use a script that has many NFC/NFD Unicode ambiguities

hadarishav commented 1 year ago

Thanks! I wrote a script to check if all characters present during the alignment process are also present in the symbols/and or the mappings. I also set a breakpoint at the spot to inspect filenames causing the problem. All the characters in those files are present in the symbols or mapping. Maybe I am missing something, do you have any other ideas?

roedoejet commented 1 year ago

It's possible that the problem is with tokenization? Is it possible to send me the text for the entry in your file list that's causing the problem along with the preprocessed text and pitch tensors and a link to your g2p mapping? Sorry this isn't easier. We are working on a model that logs all of these types of common problems in a way that's easier to debug.

hadarishav commented 1 year ago

Thanks Aidan! I have emailed you the files.

roedoejet commented 1 year ago

Update for anybody who happens to come across this post:

We changed the loop that generates the symbol set from g2p to c['out'].strip() instead of just c['out'] as on my version there appeared to be some extra white space for some of the symbols:

for lang in MAPPINGS.keys():
    if isinstance(MAPPINGS[lang]['ipa'], CompositeTransducer):
        chars = MAPPINGS[lang]["ipa"]._transducers[-1].mapping.mapping
    else:
        chars = MAPPINGS[lang]["ipa"].mapping.mapping
    IPA[lang] = [normalize('NFC', c['out'].strip()) for c in chars]

This made the model train for longer (it used to fail on the first epoch), but we are still getting errors:

image

I'm bringing this conversation back on to GitHub instead of email because I imagine other people might run into the same or similar issues.

@hadarishav, I would be curious to see the tensorboard logs for this. I'm going to guess it's something related to this issue that Christoph Minixhofer does a good job of describing here: https://twitter.com/cdminix/status/1501148854560903170 It's a known bug of the original FastSpeech2 implementation that has not been fixed, which itself is a bug in the https://pypi.org/project/tgt/ library for processing TextGrid files. I believe https://github.com/dopefishh/pympi does not have the same problem so a possible solution would be to replace the libraries, although I'm not sure I will have time to do this for a while longer. I'm guessing I never ran into this problem because my data was all fairly short utterances without much silence. Another possible fix which some people have talked about is splitting your dataset on silences into smaller sets, but this seems annoying to do. Out of curiosity, what is the average length of your utterances, and what is the longest one?

It looks like this is happening for other people too: see https://github.com/ming024/FastSpeech2/issues/176 (and a number of other similar issues).

roedoejet commented 1 year ago

OK - I actually just tried to fix this here: https://github.com/roedoejet/FastSpeech2/commit/59e041695eeeac43cb793d336a365bbe8a0bb8f2 - I wonder if you try to pull in the changes from that branch (while still keeping your symbols.py and all other language-specific things the same) and then run preprocessing again (no need to train MFA again) if that would work. This should handle the silences in your dataset at least...

hadarishav commented 1 year ago

Hi, thanks for the suggestions and sorry for the late reply. I made the changes you did. I am still running into the same problem. Max length of an utterance is 119 seconds and average length is 8 seconds. I can't really understand why would the error come up only in later epochs? One epoch is one full pass over data, so if one epoch is successful shouldn't all epochs be successful? I will send you the log file via email (unable to attach it here).

roedoejet commented 1 year ago

Hi @hadarishav

I've been away on holiday but taking another look at this now. Your log loked kind of like this:

Step 100/30000, Total Loss: 9.1719, Mel Loss: 3.6087, Mel PostNet Loss: 3.1434, Pitch Loss: 1.7511, Duration Loss: 0.0000
Step 200/30000, Total Loss: 6.9898, Mel Loss: 2.9933, Mel PostNet Loss: 2.5558, Pitch Loss: 1.0089, Duration Loss: 0.0000
Step 300/30000, Total Loss: 6.2537, Mel Loss: 2.3606, Mel PostNet Loss: 2.1579, Pitch Loss: 1.4273, Duration Loss: 0.0000
Step 400/30000, Total Loss: 4.7442, Mel Loss: 1.8938, Mel PostNet Loss: 1.6926, Pitch Loss: 0.9366, Duration Loss: 0.0000

Which suspiciously has a duration loss of 0.0000 the whole time. Here's a sample of me running it for 400 steps with my 2 hour Mohawk dataset:

Step 100/300000, Total Loss: 12.3155, Mel Loss: 3.5993, Mel PostNet Loss: 3.0231, Pitch Loss: 5.0587, Duration Loss: 0.6344
Step 200/300000, Total Loss: 7.8056, Mel Loss: 3.1922, Mel PostNet Loss: 2.8202, Pitch Loss: 1.4411, Duration Loss: 0.3521
Step 300/300000, Total Loss: 7.2452, Mel Loss: 1.9221, Mel PostNet Loss: 1.6909, Pitch Loss: 3.3842, Duration Loss: 0.2480
Step 400/300000, Total Loss: 4.4613, Mel Loss: 1.5643, Mel PostNet Loss: 1.4003, Pitch Loss: 1.2436, Duration Loss: 0.2531

If the duration loss isn't being calculated properly the weights might be updated in a way that makes training unstable. You might be able to get through one epoch before this causes problems or you might not, depending on learning rate etc. I think there is some problem with extracting durations. Perhaps there is some error extracting the durations from your textgrids? Your preprocessed duration files should be in preprocessed_data//duration and they should be .npy files containing a single numpy array equal to the length of your text input. The values should be long/integers like: array([ 6, 5, 13, 8, 9, 9, 16, 7, 14, 26, 6, 3, 6, 7, 7, 8, 8, 3, 4, 6, 3, 4, 6, 8, 9, 2, 6, 13, 6, 12, 14]) for example.

hadarishav commented 1 year ago

Hi @roedoejet, thanks for all your input. Really appreciate it. So far I was using a publicly available Hindi dataset which was giving these issues. We collected data on a low resource Indian language on which the code works absolutely fine. I will close the issue now. Thanks again!

roedoejet commented 1 year ago

No problem @hadarishav - glad to hear it! I'm developing a TTS toolkit that will hopefully make the process a lot easier (and sound better too!) so I'll let you know when that's public. In the meantime, don't hesitate to reach out if anything else comes up. Cheers.