Question related to pretraining with residual connection

Hi,

I wonder, since the proposed one-step decoder uses full-length encoded representation as well,

is it possible that the decoder will see the unpooled representation having more representation power rather than the pooled one, thus causes less dependency on the pooling operation to achieve good result for pretraining(correct me if i’m wrong)? If so, how to ensure the effectness of encoded representation if used in downstream tasks?

Thanks

laiguokun / Funnel-Transformer

Question related to pretraining with residual connection #10