roedoejet / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
22 stars 7 forks source link

Error adding a new language (tensor size not matched) #5

Closed jzhu709 closed 1 year ago

jzhu709 commented 1 year ago

Hi Aidan, I love the system!

I had successfully ran the program back when I didn't have a complete lexicon (although it didn't produce a good result...). Now that I have access to the lexicon (which could do with some cleaning) I'm running into problems with training. image I'm getting a RuntimeError: The size of tensor a (66) must match the size of tensor b (67) at non-singleton dimension 1. I read that this was due to symbols missing, but I think the symbols should all be there... (unless it is due to the lack of cleaning which I will do when I can). Also I noticed I have a similar problem with https://github.com/roedoejet/FastSpeech2/issues/4#issue-1463589714 where the duration loss is stuck at 0.

What I've done:

  1. Have the .lab and .wav files in raw_data folder
  2. Created a lexicon in IPA using your G2P library.
  3. Run MFA using the raw_data and lexicon (I also noticed when running MFA I was getting messages such as "no files were aligned, this likely indicates serious problems with the aligner." )
  4. Updated the symbols file following https://github.com/roedoejet/FastSpeech2/issues/4#issuecomment-1327102443 (I didn't update the cleaners file since it looks like using your G2P library covers the cleaning since it loops through?)
  5. Run proprocess.py
  6. Run train.py

Do you know a fix for this?

Thanks.

roedoejet commented 1 year ago

Hi @jzhu709 - thanks for this clear issue report! Unfortunately as I mentioned in the email, I'm now away until the beginning of June, so I won't be able to help out until then. However, if you have double-checked the symbols sets, then my guess is that there are issues in your data around alignment (ie mismatches between the audio and your transcription).

This is kind of hard to diagnose when you don't have an aligner that is working for your language, but I have a couple of suggestions:

First, remove silence in your audio before training the aligner (MFA) and FastSpeech2. You will need to play with this a bit, to make sure it doesn't cut out too much from the recording, but installing SoX and doing the following command sox in.wav out.wav silence 1 0.1 1% -1 0.3 1% on your audio is a good start. What this says is basically is: "at the beginning of words, remove silence at a 1% threshold up to 0.1s before the speech starts, elsewhere, trim silence after 0.3seconds of <1% silence".

Second, for each sample in your data, count the number of words/characters in each sample, and calculate the length of the corresponding audio (after you have removed silence). Then, calculate the mean/standard deviation of this ratio and throw away any outlier data (which can often be a result of the transcription being off, for example it's unlikely a transcription that says a 0.5s audio clip has 20 words in it is correct)

Try running MFA and FS2 again with these potential problem datapoints removed and see if you still have the same issue. As a last resort you could try setting a breakpoint:

try:
     x = x + pitch_embedding
except:
     breakpoint()

and then inspecting the data that caused it to fail for any irregular looking values.

Anyways, I hope this helps, good luck, and I'll check back in in June to see if there are any updates here.

roedoejet commented 1 year ago

Hi @jzhu709 - any update? If you fixed it, I'll close this, but if you could let me know what the issue was that would be great. Thanks!

jzhu709 commented 1 year ago

Yep we can close the issue! The tensor mismatch was due to MFA generating less textgrids than the input size of the corpus. Making sure the number of files in the raw_data matched the TextGrid folder solved this.

I also played around with the config file to fix the duration loss stuck at 0 issue, by changing decoder_layer to 6, spe_features and use_energy_predictor to true, and depthwise_convolutions to false in model.yaml. I also changed use_spe_features to true in preprocess.yaml and decreased val_size to a much smaller number since I have a low amount of data. (the default value of 512 is so high for us low resourced language users!)

I have no idea what settings I'm playing around with but it seemed to get the job done 😄

roedoejet commented 1 year ago

Awesome, thanks @jzhu709 !