Use MMI not CTC model for alignment

k2-fsa / snowfall

Moved to https://github.com/k2-fsa/icefall

Apache License 2.0

143 stars 42 forks source link

Use MMI not CTC model for alignment #203

Open danpovey opened 3 years ago

danpovey commented 3 years ago

Below are some notes I made about results. There is a modest improvement of around 0.3% absolute on test-other, from using the MMI not CTC model for alignment.

  `mmiali` experiment, branch=mmiali.  Use MMI TDNN+LSTM model, not CTC model, for alignment; requires retraining
  MMI TDNN+LSTM model with subsampling-factor=4 to avoid mismatch.

 The baseline for what's below (which was trained with
 mmi_att_transformer_train.py with --world-size=2 and --full-libri=False) can be
 taken to be: 6.82%, 18.00%, 5.78%, 15.46%, taken from
  /ceph-dan/snowfall/egs/librispeech/asr/simple_v1/exp-conformer-noam-mmi-att-musan-sa-vgg-rework (the checked-in
  result with vgg frontend in RESULTS.md is with 1 job not 2).

 2021-06-02 10:49:26,220 INFO [common.py:380] [test-clean] %WER 6.81% [3583 / 52576, 496 ins, 284 del, 2803 sub ]
 2021-06-02 10:51:41,617 INFO [common.py:380] [test-other] %WER 17.64% [9234 / 52343, 1024 ins, 848 del, 7362 sub ]
[with 4-gram LM rescoring]:
 2021-06-02 12:11:52,226 INFO [common.py:391] [test-clean] %WER 5.72% [3009 / 52576, 566 ins, 158 del, 2285 sub ]
 2021-06-02 12:18:23,522 INFO [common.py:391] [test-other] %WER 15.18% [7946 / 52343, 1176 ins, 538 del, 6232 sub ]

  # Below is the model from exp-lstm-adam-mmi-bigram-musan-dist-s4/epoch-9.pt:
  this expt (with subsampling-factor=4):
     2021-06-01 21:08:15,043 INFO [mmi_bigram_decode.py:261] %WER 10.66% [5604 / 52576, 718 ins, 587 del, 4299 sub ]
  baseline (with subsampling-factor=3):
     2021-06-01 12:06:43,106 INFO [mmi_bigram_decode.py:261] %WER 10.38% [5455 / 52576, 713 ins, 510 del, 4232 sub ]

danpovey commented 3 years ago

Results after training with 1 job only (and uncommenting ali_model.eval(), which I doubt it matters), were:

2021-06-02 19:53:43,152 INFO [common.py:391] [test-clean] %WER 6.85% [3604 / 52576, 530 ins, 278 del, 2796 sub ]
 2021-06-02 19:56:17,121 INFO [common.py:391] [test-other] %WER 17.57% [9195 / 52343, 1081 ins, 787 del, 7327 sub ]
and with LM rescoring:
 2021-06-02 19:55:17,350 INFO [common.py:391] [test-clean] %WER 5.83% [3065 / 52576, 612 ins, 158 del, 2295 sub ]
 2021-06-02 20:02:43,266 INFO [common.py:391] [test-other] %WER 15.30% [8006 / 52343, 1268 ins, 488 del, 6250 sub ]

vs. the checked-in results from @zhu-han which were:

# average over last 5 epochs (LM rescoring with whole lattice)
2021-05-02 00:36:42,886 INFO [common.py:381] [test-clean] %WER 5.55% [2916 / 52576, 548 ins, 172 del, 2196 sub ]
2021-05-02 00:47:15,544 INFO [common.py:381] [test-other] %WER 15.32% [8021 / 52343, 1270 ins, 501 del, 6250 sub ]

# average over last 5 epochs
2021-05-01 23:35:17,891 INFO [common.py:381] [test-clean] %WER 6.65% [3494 / 52576, 457 ins, 293 del, 2744 sub ]
2021-05-01 23:37:23,141 INFO [common.py:381] [test-other] %WER 17.68% [9252 / 52343, 1020 ins, 858 del, 7374 sub ]

... so according to this, it does not really make a difference which model we use for alignment.

pzelasko commented 3 years ago

Would it make sense to use a pure TDNN/TDNNF/CNN model for alignments? I was investing alignments from the conformer recently and my feeling was that they weren't perfect (even though the test-clean WER is ~4%) -- i.e., they seem a bit warped/shifted sometimes, but not in a consistent way. I think that the self-attention layers allow to "cheat" to some extent with the alignments, I don't know if the same happens with RNN. I doubt that the same would happen with local-context models though. Unfortunately, I don't have any means to provide a more objective evaluation than showing a screenshot (look closely at the boundaries with silences).

danpovey commented 3 years ago

That's interesting, how did you obtain that plot? I think it may be hard to prevent the conformer model from doing this kind of thing using the current alignment method, since it's only present early in training and is not really a constraint.

I am thinking it might be possible, though, if we had model that was good for alignment, to save 'constraints' derived from it, similar to what we do with Kaldi's LF-MMI training. That is: to get (say) the one-best path from it, save it as a tensor of int32_t e.g. as a .pt file indexed by utterance-id, and load that when training; and then to extend the boundaries of the phones by a couple frames and treat it as a mask on the nnet output, masking all (non-blank) phones that are not allowed by the alignment by adding a negative number to them. The only thing is, this will tend to interact with data augmentation and batching. It might be a little complicated to have that information pass through those transforms.

pzelasko commented 3 years ago

I'll submit a PR with the code that allows computing alignments and visualizing them later.

As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader.

danpovey commented 3 years ago

cool!

On Wednesday, June 2, 2021, Piotr Żelasko @.***> wrote:

I'll submit a PR with the code that allows computing alignments and visualizing them later.

As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/203#issuecomment-853060441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO355IV3GUMJCN54IFTTQY3RVANCNFSM45565RSQ .

pzelasko commented 3 years ago

Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important.

danpovey commented 3 years ago

Mm, I would have expected the MMI one would be better; but since we're just using this at the start of training to guide the model towards plausible alignments, it could be that the difference gets lost by the end.

On Tue, Jul 13, 2021 at 1:21 AM Piotr Żelasko @.***> wrote:

Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/203#issuecomment-878455204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3JJUCLR6JWKDUK6H3TXMQAXANCNFSM45565RSQ .