Closed jzhu709 closed 1 year ago
Hi @jzhu709 - thanks for this clear issue report! Unfortunately as I mentioned in the email, I'm now away until the beginning of June, so I won't be able to help out until then. However, if you have double-checked the symbols sets, then my guess is that there are issues in your data around alignment (ie mismatches between the audio and your transcription).
This is kind of hard to diagnose when you don't have an aligner that is working for your language, but I have a couple of suggestions:
First, remove silence in your audio before training the aligner (MFA) and FastSpeech2. You will need to play with this a bit, to make sure it doesn't cut out too much from the recording, but installing SoX and doing the following command sox in.wav out.wav silence 1 0.1 1% -1 0.3 1%
on your audio is a good start. What this says is basically is: "at the beginning of words, remove silence at a 1% threshold up to 0.1s before the speech starts, elsewhere, trim silence after 0.3seconds of <1% silence".
Second, for each sample in your data, count the number of words/characters in each sample, and calculate the length of the corresponding audio (after you have removed silence). Then, calculate the mean/standard deviation of this ratio and throw away any outlier data (which can often be a result of the transcription being off, for example it's unlikely a transcription that says a 0.5s audio clip has 20 words in it is correct)
Try running MFA and FS2 again with these potential problem datapoints removed and see if you still have the same issue. As a last resort you could try setting a breakpoint:
try:
x = x + pitch_embedding
except:
breakpoint()
and then inspecting the data that caused it to fail for any irregular looking values.
Anyways, I hope this helps, good luck, and I'll check back in in June to see if there are any updates here.
Hi @jzhu709 - any update? If you fixed it, I'll close this, but if you could let me know what the issue was that would be great. Thanks!
Yep we can close the issue! The tensor mismatch was due to MFA generating less textgrids than the input size of the corpus. Making sure the number of files in the raw_data matched the TextGrid folder solved this.
I also played around with the config file to fix the duration loss stuck at 0 issue, by changing decoder_layer to 6, spe_features and use_energy_predictor to true, and depthwise_convolutions to false in model.yaml. I also changed use_spe_features to true in preprocess.yaml and decreased val_size to a much smaller number since I have a low amount of data. (the default value of 512 is so high for us low resourced language users!)
I have no idea what settings I'm playing around with but it seemed to get the job done 😄
Awesome, thanks @jzhu709 !
Hi Aidan, I love the system!
I had successfully ran the program back when I didn't have a complete lexicon (although it didn't produce a good result...). Now that I have access to the lexicon (which could do with some cleaning) I'm running into problems with training. I'm getting a
RuntimeError: The size of tensor a (66) must match the size of tensor b (67) at non-singleton dimension 1
. I read that this was due to symbols missing, but I think the symbols should all be there... (unless it is due to the lack of cleaning which I will do when I can). Also I noticed I have a similar problem with https://github.com/roedoejet/FastSpeech2/issues/4#issue-1463589714 where the duration loss is stuck at 0.What I've done:
Do you know a fix for this?
Thanks.