espnet-style attn_output_weight scaling and extra after-norm layer

glynpu commented 3 years ago

Conformer structure differences are identified by loading espnet trained model into snowfall. https://github.com/k2-fsa/snowfall/pull/201

snowfall only scaling q; while espnet scale attn_output_weights.
espnet conformer has an extra layer_norm after encoder

With these two modifications and 30 epoch training, final result is a bit better(3.69 < 3.86 as reported in https://github.com/k2-fsa/snowfall/issues/154) than otherwise.

Could you help verify their effectiveness (maybe they are just training variance)? @zhu-han @pzelasko BTW, is there any mathmatics background which explains when to scaling during attn_output_weights computation? I read several papers, but failed to find a clue about this.

Rescoring WITH 4-gram lm lattice rescore with modifications of this pr

avg epoch 16-20
2021-06-02 19:37:12,429 INFO [common.py:380] [test-clean] %WER 3.77% [1983 / 52576, 348 ins, 105 del, 1530 sub ]
2021-06-02 21:38:48,140 INFO [common.py:382] [test-other] %WER 7.86% [4116 / 52343, 704 ins, 260 del, 3152 sub ]
avg epoch 26-30
2021-06-02 19:25:40,616 INFO [common.py:380] [test-clean] %WER 3.69% [1938 / 52576, 386 ins, 96 del, 1456 sub ]
2021-06-02 21:45:22,304 INFO [common.py:382] [test-other] %WER 7.68% [4021 / 52343, 746 ins, 251 del, 3024 sub ]

results of 4-gram lattice rescore from #154

avg epoch 16-20
2021-05-21 09:46:26,814 INFO [common.py:380] [test-clean] %WER 3.87% [2036 / 52576, 334 ins, 116 del, 1586 sub ]
2021-05-21 09:53:26,347 INFO [common.py:380] [test-other] %WER 8.08% [4231 / 52343, 710 ins, 241 del, 3280 sub ]
avg epoch 26-30
2021-05-22 14:53:36,527 INFO [common.py:380] [test-clean] %WER 3.86% [2030 / 52576, 345 ins, 114 del, 1571 sub ]
2021-05-22 15:00:10,075 INFO [common.py:380] [test-other] %WER 8.07% [4223 / 52343, 708 ins, 254 del, 3261 sub ]

Rescoring WITHOUT 4-gram lm lattice rescore with modifications of this pr

avg epoch 16-20
2021-06-02 21:54:00,942 INFO [common.py:382] [test-clean] %WER 4.26% [2241 / 52576, 278 ins, 184 del, 1779 sub ]
2021-06-02 21:55:52,071 INFO [common.py:382] [test-other] %WER 8.61% [4505 / 52343, 602 ins, 386 del, 3517 sub ]
avg epoch 26-30
2021-06-02 21:49:51,271 INFO [common.py:382] [test-clean] %WER 4.14% [2179 / 52576, 296 ins, 177 del, 1706 sub ]
2021-06-02 21:51:30,037 INFO [common.py:382] [test-other] %WER 8.41% [4402 / 52343, 626 ins, 380 del, 3396 sub ]

results from #154

avg epoch 16-20
2021-05-21 09:34:55,569 INFO [common.py:380] [test-clean] %WER 4.33% [2274 / 52576, 268 ins, 183 del, 1823 sub ]
2021-05-21 09:35:43,453 INFO [common.py:380] [test-other] %WER 8.96% [4690 / 52343, 584 ins, 389 del, 3717 sub ]
avg epoch 26-30
2021-05-22 14:45:39,709 INFO [common.py:380] [test-clean] %WER 4.31% [2267 / 52576, 293 ins, 182 del, 1792 sub ]
2021-05-22 14:46:36,179 INFO [common.py:380] [test-other] %WER 8.98% [4700 / 52343, 610 ins, 388 del, 3702 sub ]

danpovey commented 3 years ago

The scaling is just so that, assuming the input variance is about 1, the variance going into the softmax is about 1. But the difference between the 2 scaling methods only affects the bias parameters (effectively this new way scales down the bias).. it's surprising that it makes so much difference, perhaps this new version does not focus too much on nearby frames.

danpovey commented 3 years ago

Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring.

danpovey commented 3 years ago

.. don't you have the test-other results?

glynpu commented 3 years ago

Results of before rescoring and "test_other" are giving soon(being re-tested.)

glynpu commented 3 years ago

Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring.

Relative wer decrease seems no significant difference before and after LM rescoring.

avg epoch 16-20	no rescore	no rescore	4-gram lattice rescore	4-gram lattice rescore
	test-clean	test-other	test-clean	test-other
before	4.33	8.96	3.87	8.08
current	4.26	8.61	3.77	7.86
relative decrease	1.62%	3.91%	2.58%	2.72%

avg epoch 26-30	no rescore	no rescore	4-gram lattice rescore	4-gram lattice rescore
test-clean	test-other	test-clean	test-other
before	4.31	8.98	3.86	8.07
current	4.14	8.41	3.69	7.68
relative decrease	3.94%	6.35%	4.40%	4.83%

danpovey commented 3 years ago

still better though.. good..

On Wednesday, June 2, 2021, LIyong.Guo @.***> wrote:

Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring.

Relative wer decrease seems no significant difference before and after LM rescoring. avg epoch 16-20 no rescore no rescore 4-gram lattice rescore 4-gram lattice rescore test-clean test-other test-clean test-other before 4.33 8.96 3.87 8.08 current 4.26 8.61 3.77 7.86 relative decrease 1.62% 3.91% 2.58% 2.72% avg epoch 26-30 no rescore no rescore 4-gram lattice rescore 4-gram lattice rescore test-clean test-other test-clean test-other before 4.31 8.98 3.86 8.07 current 4.14 8.41 3.69 7.68 relative decrease 3.94% 6.35% 4.40% 4.83%

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/204#issuecomment-853063100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4QDSQ3H7TBDTICNCLTQY36DANCNFSM456R7TKQ .

danpovey commented 3 years ago

Can you make this an option passed in from the user code, like in your other branch, so that we can more easily decode with "old" models if we need to?

danpovey commented 3 years ago

..I'm just concerned it might be disruptive to make this change as-is.

glynpu commented 3 years ago

..I'm just concerned it might be disruptive to make this change as-is.

To be compatible to previously trained models, maybe an optional config, e.g. _is_espnetstructure (or another properer name) which default be false could be used. Like these:

def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 4,
                   ....
                   is_espnet_structure: bool = False) -> None:
                   ...
                   self.is_espnet_structure = is_espnet_structure
                   if self.normalize_before and self.is_espnet_structure:
                       self.after_norm = nn.LayerNorm(d_model)

danpovey commented 3 years ago

Yes. We can change it to True in our current scripts; but it would at least make it possible to revert to False so we can test old models.

danpovey commented 3 years ago

Thanks a lot!

k2-fsa / snowfall

espnet-style attn_output_weight scaling and extra after-norm layer #204