k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
928 stars 295 forks source link

Type of phoneme units in conformer ctc phoneme-based set-up #195

Open armusc opened 2 years ago

armusc commented 2 years ago

Hi

I would like to try a phoneme-based setting on the conformer ctc architecture (encoder only as I see that decoder can not be used in this case due to sos/eos symbols). I'm not sure you have tested this kind of model and what type of phoneme units have you used in that case; I tried using monophones with no word position information associated; results do not look good. It seems strange not to use any context at all with such short acoustic units. Recommendations?

armusc commented 2 years ago

My phone-based trainings do not seem to converge at all At first I trained a bpe 500 ctc conformer model and it works well this is the training tensorboard Capture d’écran de 2022-01-30 21-15-29

then two phone based settings: one based on conformer (encoder only, same parameters of bpe 500 decoder) and this is the training tensoboard Capture d’écran de 2022-01-30 22-54-18

and then the tdnn-lstm model always based on ctc loss

Capture d’écran de 2022-01-30 21-03-27

with default training parameters; I'm struggling to understand where the problem lies; as far as I know, only the lang directory change; the lexicon is phone based, the tokens are monophones, but the preparation has worked well I have the words.txt tokens.txt L.pt files the dataset is identical to the bpe 500 training, so are the lhotse manifests and torch data loaders sprry to bother, but do some of you have any tips where should I focus

thanks

danpovey commented 2 years ago

I would focus on the TDNN+LSTM first, because that one is generally easier to get to converge (transformers/conformers can be harder). I think Librispeech can be hard to converge initially, because the utterances are long. I'm a little unclear how you did the attention part, if you have a phone-based LM. If there are optional silences and multiple pronunciations, how did you decide what to make the supervision sequence for attention? I believe our example scripts may still have a part showing how you can do phone-based training-- certainly it shows how to prepare the phone lang dir, but I don't know what you have to change to train the system that way.

armusc commented 2 years ago

Thanks for your reply

-I have been using LIbrispeech scripts and set up for those training runs, but the data set is a different one: it comprises about 40 hours of Dutch BN 16KHz data to which 6 additional copies were added based on augmentation tehcniques (Kaldi multistyle training, 2 speed perturbation + reverb and the 3 Musan-based noises); the Kaldi data files were imported on Lhotse manifests but the fearures were re-computed

1) the bpe 500 training have been sucessful, the tensorboard logs are reported in the previous post; the WER is the ~15% relative better than my best Kaldi WER on same dataset (and augmentation tehcniques), i.e. tdnn "7n" with chain left-biphones and i-vectors 1a) also tried to reduce the encoder-decoder size to have a 20 M parameters model and it worked very well, with a negligible performance loss with respect to the original model size ~100M parameters, faster to decode and with much less memory occupation also (good especially with GPU with limited memory); btw thsi seems to confirm also the results of the orginal conformzr papers for different model sizes

2) I then used the same data (i.e. asr_datamodule.py is the same) to try a phone-based system, because I believe phone-based phonetisation can still be very valuable for certain languages and in deciding pronunciation for foreign words etc... I did not use any decoder/attention loss in the conformer ctc setting. --num-decoder-layers 0 --att-rate 0.0 this was already recommended in the script train.py my data/lang_phone looks fine, the lexicon is a monophone based one, words.txt and tokens.txt L and L_disambig exist

head data/lang_phone_train/lexicon.txt 'm @@ mm 'm EE mm 's ss 't tt +BREATH+ +BREATH+ +CONV+ +CONV+ +NOISE+ +NOISE+ 3D-printing dd rr ii dd ee pp rr II nn tt II NN 3G dd rr ii GG ee 3G-netwerk dd rr ii GG ee nn EE tt VV EE rr kk

head -5 data/lang_phone_train/words.txt

0 'm 1 's 2 't 3 +BREATH+ 4 head -5 data/lang_phone_train/tokens.txt 0 +BREATH+ 1 +CONV+ 2 +NOISE+ 3 @@ 4 ll data/lang_phone_train/L.pt -rw-rw---- 1 amuscariello users 5090727 janv. 26 13:20 data/lang_phone_train/L.pt ll data/lang_phone_train/L_disambig.pt -rw-rw---- 1 amuscariello users 5218343 janv. 26 13:20 data/lang_phone_train/L_disambig.pt the model does not converge as it shown in the second tensorboard in the previous post 3) I tried the same data set and the same data/lang_phone with tdnn-lstm since I have a reference WER provided by icefall on Librispeech, so, while there is a big performance loss w.r.t to conformer ctc bpe, it still works in my case the model again does not converge as shown in the third tensorboard in the previous post Either I'm doing something silly I can not see at the moment, or might it be that on this dataset (40h) monophone systems with end-to-end models are hard to converge? seems strange but I don't have experience on this
danpovey commented 2 years ago

Could it be because you removed the attention-decoder loss? That part of the loss helps the CTC part to converge. If your utterances are relatively long, that can be an impediment to convergence.

On Mon, Jan 31, 2022 at 4:57 PM armusc @.***> wrote:

Thanks for your reply

-I have been using LIbrispeech scripts and set up for those training runs, but the data set is a different one: it comprises about 40 hours of Dutch BN 16KHz data to which 6 additional copies were added based on augmentation tehcniques (Kaldi multistyle training, 2 speed perturbation + reverb and the 3 Musan-based noises); the Kaldi data files were imported on Lhotse manifests but the fearures were re-computed

1.

the bpe 500 training have been sucessful, the tensorboard logs are reported in the previous post; the WER is the ~15% relative better than my best Kaldi WER on same dataset (and augmentation tehcniques), i.e. tdnn "7n" with chain left-biphones and i-vectors 1a) also tried to reduce the encoder-decoder size to have a 20 M parameters model and it worked very well, with a negligible performance loss with respect to the original model size ~100M parameters, faster to decode and with much less memory occupation also (good especially with GPU with limited memory) 2.

I then used the same data (i.e. asr_datamodule.py is the same) to try a phone-based system, because I believe phone-based phonetisation can still be very valuable for certain languages and in deciding pronunciation for foreign words etc... I did not use any decoder/attention loss in the conformer ctc setting. --num-decoder-layers 0 --att-rate 0.0 this was already recommended in the script train.py my data/lang_phone looks fine, the lexicon is a monophone based one, words.txt and tokens.txt L and L_disambig exist

head data/lang_phone_train/lexicon.txt 'm @@ mm 'm EE mm 's ss 't tt +BREATH+ +BREATH+ +CONV+ +CONV+ +NOISE+ +NOISE+ 3D-printing dd rr ii dd ee pp rr II nn tt II NN 3G dd rr ii GG ee 3G-netwerk dd rr ii GG ee nn EE tt VV EE rr kk

head -5 data/lang_phone_train/words.txt 0 'm 1 's 2 't 3 +BREATH+ 4

head -5 data/lang_phone_train/tokens.txt 0 +BREATH+ 1 +CONV+ 2 +NOISE+ 3 @@ 4

ll data/lang_phone_train/L.pt -rw-rw---- 1 amuscariello users 5090727 janv. 26 13:20 data/lang_phone_train/L.pt ll data/lang_phone_train/L_disambig.pt -rw-rw---- 1 amuscariello users 5218343 janv. 26 13:20 data/lang_phone_train/L_disambig.pt

the model does not converge as it shown in the second tensorboard in the previous post

  1. I tried the same data set and the same data/lang_phone with tdnn-lstm since I have a reference WER provided by icefall on Librispeech, so, while there is a big performance loss w.r.t to conformer ctc bpe, it still works in my case the model again does not converge as shown in the third tensorboard in the previous post

Either I'm doing something silly I can not see at the moment, or might it be that on this dataset (40h) monophone systems with end-to-end models are hard to converge? seems strange but I don't have experience on this

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/195#issuecomment-1025508702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO67LMMUAX2GKUURWSTUYZFFZANCNFSM5NBL7GAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

armusc commented 2 years ago

Could it be because you removed the attention-decoder loss? That part of the loss helps the CTC part to converge. If your utterances are relatively long, that can be an impediment to convergence.

I thought that might play a role, that's why I wanted to try the tdnn-lstm approach, where there is no decoder / attention loss and you have a reasonable result on Librispeech the corpus is like this (~35 hours * 6 (augmented copies))

Total duration (hours): 240.0 Speech duration (hours): 240.0 (100.0%)


Duration statistics (seconds): mean 4.2 std 2.7 min 0.1 25% 2.2 50% 3.6 75% 5.4 max 39.8

danpovey commented 2 years ago

OK, the data doesn't seem excessively long. Could try a lower learning rate, maybe?

On Mon, Jan 31, 2022 at 7:03 PM armusc @.***> wrote:

Could it be because you removed the attention-decoder loss? That part of the loss helps the CTC part to converge. If your utterances are relatively long, that can be an impediment to convergence.

I thought that might play a role, that's why I wanted to try the tdnn-lstm approach, where there is no decoder / attention loss and you have a reasonable result on Librispeech the corpus is like this (~35 hours * 6 (augmented copies))

Total duration (hours): 240.0 Speech duration (hours): 240.0 (100.0%)

Duration statistics (seconds): mean 4.2 std 2.7 min 0.1 25% 2.2 50% 3.6 75% 5.4 max 39.8

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/195#issuecomment-1025617654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3OKJKJFQYZNOS74QTUYZT7PANCNFSM5NBL7GAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

pzelasko commented 2 years ago

FWIW I contributed the changes that made the it possible to train phone conformer with pure CTC in Icefall again. I don't remember the exact number I got with LibriSpeech phone based system, but the WER wasn't very far away from BPE+CTC+attention decoder, even though it didn't use attention decoder at all. But since your data is much smaller the training might behave differently...

BTW I just noticed in your tensorboard that because the model doesn't have a decoder, your max LR automatically went down from ~2e-3 to ~7e-4. You might need to carefully investigate which hyperparameters are affected by the model having less params and tweak them.

armusc commented 2 years ago

thanks, I'll let you all know, I have to share two GPUs so I might not have results immediately, even though the training itself is fast with this dataset

armusc commented 2 years ago

FWIW I contributed the changes that made the it possible to train phone conformer with pure CTC in Icefall again. I don't remember the exact number I got with LibriSpeech phone based system, but the WER wasn't very far away from BPE+CTC+attention decoder, even though it didn't use attention decoder at all. But since your data is much smaller the training might behave differently...

BTW I just noticed in your tensorboard that because the model doesn't have a decoder, your max LR automatically went down from ~2e-3 to ~7e-4. You might need to carefully investigate which hyperparameters are affected by the model having less params and tweak them.

btw, when you said you tried with phones, you meant monophones? no context at all?

pzelasko commented 2 years ago

Yes, CMUdict monophones with lexical stress markers (typical Libri setup) without positional phone markers (_{B,I,E,S}), I think there's about 70 tokens total.