ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.69k stars 515 forks source link

MFA version #225

Closed shreeshailgan closed 2 months ago

shreeshailgan commented 4 months ago

Hey @ming024, Could you specify the MFA version you used to generate the textgrids you have provided in your repo? Also, did you generate those textgrids by just aligning using a pre-existing acoustic model or by using the train-and-align step on the dataset itself?

Asking because I've been using the latest MFA version (=3.0.0) and textgrid outputs I'm getting have alignment errors compared to the textgrids you have provided. This is also leading to issues in training since the model I trained using your provided textgrids works fine, but the model I trained using my own generated textgrids has issues - the quality of the synthesized audio degrades very fast with time. The audio is fine for the first 2-3 seconds, but then degrades very quickly after that.

Thanks.

asarsembayev commented 2 months ago

Hi!

I'm also interested in this question. I've been training FS2 on a custom dataset. There's a pretrained MFA acoustic model for the language I've been training (Kazakh), but that model was trained on a very small corpus. Instead, I had quite a big one (30 h.).

I was training MFA from scratch with mfa train and the results were not consistent. There were some errors in alignments which lead to problems with phones durations extraction.

I also assume that phonemizing help to enhance the process. I've been training on graphemes.

shreeshailgan commented 2 months ago

Hi @asarsembayev, For me, the issue of degraded model outputs was not due to errors in MFA's alignment, but because the preprocessing script was probably written to work with older versions of MFA. Newer versions contain the empty string "" in place of the sp token, which was being ignored in the preprocessing script, leading to wrong alignments.

I had to make a couple of changes to resolve this:

1] I converted empty tokens to sp around here https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/preprocessor/preprocessor.py#L262

2] I added the argument read_empty_intervals=True when reading the TextGrids. https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/preprocessor/preprocessor.py#L163

asarsembayev commented 2 months ago

Hi @asarsembayev, For me, the issue of degraded model outputs was not due to errors in MFA's alignment, but because the preprocessing script was probably written to work with older versions of MFA. Newer versions contain the empty string "" in place of the sp token, which was being ignored in the preprocessing script, leading to wrong alignments.

I had to make a couple of changes to resolve this:

1] I converted empty tokens to sp around here

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/preprocessor/preprocessor.py#L262

2] I added the argument read_empty_intervals=True when reading the TextGrids.

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/preprocessor/preprocessor.py#L163

have you defined which old versions were used regarding the MFA?

shreeshailgan commented 2 months ago

I think it was 1.0.1. Since that was the latest version available when this repository first release their textgrids on LJSpeech.