joshyZhou / AST

Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration
Apache License 2.0
48 stars 0 forks source link

about encoder design #8

Open zhaozhaoooo opened 1 week ago

zhaozhaoooo commented 1 week ago

Hello, I saw a paragraph in the paper that simply stated that the attention operation was omitted in the encoder module, so the encoder only consists of FFN layers.

Here, we omit the attention mechanism within the standard transformer block in the encoder, due to the fact that its low-pass filter nature can hinder learning desired local patterns, especially in the early stages.

I have some doubts about this operation, because fully connected layers can be seen as convolutional layers with a kernel size of 1 1, while FFN layers may also include a small number of 3 3 convolutional layers. Doesn't this operation make the model's receptive field too small?

joshyZhou commented 1 day ago

Hi,

Here are some insights of our design approach:

As noted by Xiao et al. [1], early layers of transformers tend to focus on learning local patterns, which somewhat undermines the advantage of self-attention’s large receptive field. To address this, incorporating convolutional layers can be an efficient strategy, as they excel at capturing local patterns. This approach is also evident in other transformer-based models, such as FFTformer [2] and FPro [3].

References: [1] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollar, and Ross Girshick. "Early Convolutions Help Transformers See Better." In NeurIPS, 2021. [2] Kong L, Dong J, Ge J, Li M, Pan J. "Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring." In CVPR 2023. [3] Zhou S, Pan J, Shi J, Chen D, Qu L, Yang J. "Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration." In ECCV 2024.