rwth-i6 / returnn_common

Common building blocks for RETURNN configs, such as models, training concepts, etc
7 stars 4 forks source link

Conformer frontend should fix dimensions, be more standard #219

Open albertz opened 1 year ago

albertz commented 1 year ago

The defaults we use for ConformerConvSubsample are wrong. Maybe also the structure of the layer is wrong. We should follow more standard code, e.g.: https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/subsampling.py#L162 https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet2/asr/encoder/conformer_encoder.py#L164

Also see relative positional encoding, #132.

albertz commented 1 year ago

Or maybe sth in between. I think our pooling is fine?

Note that we should change the dropout as well. The current default dropout is also probably way too high (0.1 by default) for such small dimensions. Once we go to 256, it is maybe ok though and the dropout can stay the same. Not sure.

Maybe we need some experiments first?

albertz commented 1 year ago

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:

        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale
albertz commented 1 year ago

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:

        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale

This is probably because such scale is also there for the word embeddings in the original Transformer.

I tried to find out on the motivation of this. I found this CV question. One answer states that it is to get the embeddings into a similar range as the positional encoding, or actually larger than the pos enc, which is important when you add them together. However, here in this case, we actually do not add them, so I wonder if the scale is really necessary or helpful, or maybe even hurtful. I guess we need to do an experiment.

albertz commented 1 year ago

As we discussed, it should be configurable to support both the RETURNN standard cases and also the ESPnet case (at least mostly, maybe except of the xscale).

For now, we would not put defaults, as it's not clear yet which are the best.

albertz commented 1 year ago

So what else is missing after #219?

Striding is one thing. What else?

albertz commented 1 year ago

Ok I added striding. But I'm thinking about refactoring it a bit more. Specifically, current problems:

albertz commented 1 year ago

I changed num_heads to 4 by default.

albertz commented 1 year ago

I renamed the things as discussed, and changed the option to a single single_layer.

albertz commented 1 year ago

In ESPnet, default cnn_module_kernel: int = 31. But we have conv_kernel_size = 32 as default. What's more reasonable?