Open albertz opened 1 year ago
Or maybe sth in between. I think our pooling is fine?
Note that we should change the dropout
as well. The current default dropout
is also probably way too high (0.1 by default) for such small dimensions. Once we go to 256, it is maybe ok though and the dropout
can stay the same. Not sure.
Maybe we need some experiments first?
I noticed, in ESPnet, when RelPositionalEncoding
is used (default), it still scales the conv-prenet output by a factor, see here, specifically:
self.xscale = math.sqrt(self.d_model)
...
x = x * self.xscale
I noticed, in ESPnet, when
RelPositionalEncoding
is used (default), it still scales the conv-prenet output by a factor, see here, specifically:self.xscale = math.sqrt(self.d_model) ... x = x * self.xscale
This is probably because such scale is also there for the word embeddings in the original Transformer.
I tried to find out on the motivation of this. I found this CV question. One answer states that it is to get the embeddings into a similar range as the positional encoding, or actually larger than the pos enc, which is important when you add them together. However, here in this case, we actually do not add them, so I wonder if the scale is really necessary or helpful, or maybe even hurtful. I guess we need to do an experiment.
As we discussed, it should be configurable to support both the RETURNN standard cases and also the ESPnet case (at least mostly, maybe except of the xscale).
For now, we would not put defaults, as it's not clear yet which are the best.
So what else is missing after #219?
Striding is one thing. What else?
Ok I added striding. But I'm thinking about refactoring it a bit more. Specifically, current problems:
dropout
option is not really useful as this is not the way how we at i6 do it and neither how ESPnet does it.ConformerConvSubsample
(e.g. Conv2dSubsampling6
) also contains the final linear transformation, but in our case we have that separate. I'm not sure what's better.ConformerConvSubsample
naming is inconsistent to subsample_conv_...
options (order swapped around).subsample_conv_...
options. I think we should follow more the style of our Transformer and also how ESPnet does it, i.e. allow to pass the whole module as an arg (conv_subsample_module
or so). In ESPnet, it is a string which then gets translated to the module class. In Transformer (ours and PyTorch standard) it is the module directly. So then you could easily pass different subsampling prenets. We also should stop extending ConformerConvSubsample
more and more, but instead have just separate subsampling modules, like ESPnet. I think this is much easier to follow.num_heads=8
as default?I changed num_heads
to 4 by default.
I renamed the things as discussed, and changed the option to a single single_layer
.
In ESPnet, default cnn_module_kernel: int = 31
. But we have conv_kernel_size = 32
as default. What's more reasonable?
The defaults we use for
ConformerConvSubsample
are wrong. Maybe also the structure of the layer is wrong. We should follow more standard code, e.g.: https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet/nets/pytorch_backend/transformer/subsampling.py#L162 https://github.com/espnet/espnet/blob/4138010fb66ad27a43e8bee48a4932829a0847ae/espnet2/asr/encoder/conformer_encoder.py#L164Also see relative positional encoding, #132.