Open danpovey opened 3 years ago
Results after training with 1 job only (and uncommenting ali_model.eval(), which I doubt it matters), were:
2021-06-02 19:53:43,152 INFO [common.py:391] [test-clean] %WER 6.85% [3604 / 52576, 530 ins, 278 del, 2796 sub ]
2021-06-02 19:56:17,121 INFO [common.py:391] [test-other] %WER 17.57% [9195 / 52343, 1081 ins, 787 del, 7327 sub ]
and with LM rescoring:
2021-06-02 19:55:17,350 INFO [common.py:391] [test-clean] %WER 5.83% [3065 / 52576, 612 ins, 158 del, 2295 sub ]
2021-06-02 20:02:43,266 INFO [common.py:391] [test-other] %WER 15.30% [8006 / 52343, 1268 ins, 488 del, 6250 sub ]
vs. the checked-in results from @zhu-han which were:
# average over last 5 epochs (LM rescoring with whole lattice)
2021-05-02 00:36:42,886 INFO [common.py:381] [test-clean] %WER 5.55% [2916 / 52576, 548 ins, 172 del, 2196 sub ]
2021-05-02 00:47:15,544 INFO [common.py:381] [test-other] %WER 15.32% [8021 / 52343, 1270 ins, 501 del, 6250 sub ]
# average over last 5 epochs
2021-05-01 23:35:17,891 INFO [common.py:381] [test-clean] %WER 6.65% [3494 / 52576, 457 ins, 293 del, 2744 sub ]
2021-05-01 23:37:23,141 INFO [common.py:381] [test-other] %WER 17.68% [9252 / 52343, 1020 ins, 858 del, 7374 sub ]
... so according to this, it does not really make a difference which model we use for alignment.
Would it make sense to use a pure TDNN/TDNNF/CNN model for alignments? I was investing alignments from the conformer recently and my feeling was that they weren't perfect (even though the test-clean WER is ~4%) -- i.e., they seem a bit warped/shifted sometimes, but not in a consistent way. I think that the self-attention layers allow to "cheat" to some extent with the alignments, I don't know if the same happens with RNN. I doubt that the same would happen with local-context models though. Unfortunately, I don't have any means to provide a more objective evaluation than showing a screenshot (look closely at the boundaries with silences).
That's interesting, how did you obtain that plot? I think it may be hard to prevent the conformer model from doing this kind of thing using the current alignment method, since it's only present early in training and is not really a constraint.
I am thinking it might be possible, though, if we had model that was good for alignment, to save 'constraints' derived from it, similar to what we do with Kaldi's LF-MMI training. That is: to get (say) the one-best path from it, save it as a tensor of int32_t e.g. as a .pt file indexed by utterance-id, and load that when training; and then to extend the boundaries of the phones by a couple frames and treat it as a mask on the nnet output, masking all (non-blank) phones that are not allowed by the alignment by adding a negative number to them. The only thing is, this will tend to interact with data augmentation and batching. It might be a little complicated to have that information pass through those transforms.
I'll submit a PR with the code that allows computing alignments and visualizing them later.
As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader.
cool!
On Wednesday, June 2, 2021, Piotr Żelasko @.***> wrote:
I'll submit a PR with the code that allows computing alignments and visualizing them later.
As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/203#issuecomment-853060441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO355IV3GUMJCN54IFTTQY3RVANCNFSM45565RSQ .
Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important.
Mm, I would have expected the MMI one would be better; but since we're just using this at the start of training to guide the model towards plausible alignments, it could be that the difference gets lost by the end.
On Tue, Jul 13, 2021 at 1:21 AM Piotr Żelasko @.***> wrote:
Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/203#issuecomment-878455204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3JJUCLR6JWKDUK6H3TXMQAXANCNFSM45565RSQ .
Below are some notes I made about results. There is a modest improvement of around 0.3% absolute on test-other, from using the MMI not CTC model for alignment.