Conformer frontend should fix dimensions, be more standard

albertz commented 1 year ago

The defaults we use for ConformerConvSubsample are wrong. Maybe also the structure of the layer is wrong. We should follow more standard code, e.g.: https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/subsampling.py#L162 https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet2/asr/encoder/conformer_encoder.py#L164

Also see relative positional encoding, #132.

albertz commented 1 year ago

Or maybe sth in between. I think our pooling is fine?

Note that we should change the dropout as well. The current default dropout is also probably way too high (0.1 by default) for such small dimensions. Once we go to 256, it is maybe ok though and the dropout can stay the same. Not sure.

Maybe we need some experiments first?

albertz commented 1 year ago

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:

        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale

albertz commented 1 year ago

I noticed, in ESPnet, when RelPositionalEncoding is used (default), it still scales the conv-prenet output by a factor, see here, specifically:
        self.xscale = math.sqrt(self.d_model)
...
        x = x * self.xscale

This is probably because such scale is also there for the word embeddings in the original Transformer.

I tried to find out on the motivation of this. I found this CV question. One answer states that it is to get the embeddings into a similar range as the positional encoding, or actually larger than the pos enc, which is important when you add them together. However, here in this case, we actually do not add them, so I wonder if the scale is really necessary or helpful, or maybe even hurtful. I guess we need to do an experiment.

albertz commented 1 year ago

As we discussed, it should be configurable to support both the RETURNN standard cases and also the ESPnet case (at least mostly, maybe except of the xscale).

For now, we would not put defaults, as it's not clear yet which are the best.

albertz commented 1 year ago

So what else is missing after #219?

Striding is one thing. What else?

albertz commented 1 year ago

Ok I added striding. But I'm thinking about refactoring it a bit more. Specifically, current problems:

Dropout is missing at the end (after the final linear transformation). The current dropout option is not really useful as this is not the way how we at i6 do it and neither how ESPnet does it.
In ESPnet, the ConformerConvSubsample (e.g. Conv2dSubsampling6) also contains the final linear transformation, but in our case we have that separate. I'm not sure what's better.
ConformerConvSubsample naming is inconsistent to subsample_conv_... options (order swapped around).
I don't like all the subsample_conv_... options. I think we should follow more the style of our Transformer and also how ESPnet does it, i.e. allow to pass the whole module as an arg (conv_subsample_module or so). In ESPnet, it is a string which then gets translated to the module class. In Transformer (ours and PyTorch standard) it is the module directly. So then you could easily pass different subsampling prenets. We also should stop extending ConformerConvSubsample more and more, but instead have just separate subsampling modules, like ESPnet. I think this is much easier to follow.
Why do we have num_heads=8 as default?

albertz commented 1 year ago

I changed num_heads to 4 by default.

albertz commented 1 year ago

I renamed the things as discussed, and changed the option to a single single_layer.

albertz commented 1 year ago

In ESPnet, default cnn_module_kernel: int = 31. But we have conv_kernel_size = 32 as default. What's more reasonable?

rwth-i6 / returnn_common

Conformer frontend should fix dimensions, be more standard #219