Do Vision Transformers See Like Convolutional Neural Networks?

reyllama commented 2 years ago

TL; DR

ViT feature representations are less hierarchical.
Early tr blocks learn both local and global dependencies provided with large enough dataset.
Skip connections play much more important roles for ViTs.
ViTs with token preserve spatial information better (better spatial localization)
ViTs benefit more from scales.

Motivation

Empirically study the internal representations of ViT and ResNet, hoping for insightful findings of the general visual reasoning mechanism of the former.

Prior Works

Vision Transformer

Method

Conduct wide range of experiments on relationships between internal representations of ViT and ResNet.
For feature level similarity computation, resort to Centered Kernel Alignment(CKA) which employs HSIC (Hilbert-Schmidt Independence Criterion)

Experiments

Representation Similarities across Blocks

ViT learns CNN inductive biase from data

Not as successful when dataset size is limited (JFT-300M vs. ImageNet)
Fails to learn local attention with ImageNet (no JFT-300M)

ResNet learns only local information in lower layers

Skip Connections are crucial in ViTs

In early layers, token repr. are free-passed since not much abstraction has been made for higher level reasoning (clever)
In later layers, spatial tokens are relativelyfree-passed and much manipulation is made to the token.
Removing skip connections in middle layers greatly affects learning dynamics.

token enables better spatial localization

ViT scales well with more data

Comments

Comprehensive studies, revealing valuable insights for the working mechanism of Vision Transformers (and ConvNets).
The finding that ViTs learn local attention with enough data was very interesting, partially explaining why Trs. outperform CNNs only when equipped with huge training dataset.
This also proves that the inductive bias encoded in CNN architecture is quite effective.
Feature similarity computation with Centered Kernel Alignment (CKA) is memory-worthy.
Also, the importance of skip connections in ViTs is quite surprising. For ResNet, skip connections are mainly for gradient propagation, enabling training of deeper models.
But for ViTs, skip connections allow delayed formulation of representation. Working on repr. without enough context modeling and abstraction clearly leads to degraded performance.
Plus, I personally thought that ConvNets would be better spatial localizers as they have much spatial inductive bias embedded, but explicit tokenization with positional embeddings leads to better preservation of spatial information.
But, does better spatial localization explained here leads to better downstream outcomes, e.g. in object detection or localization? I've heard some counter-evidences. Wonder if there are studies on these topics as well.

heng-yuwen commented 2 years ago

Hi @reyllama, your summary is fantastic! I wonder if you know the exact way that the authors use to compute the attention distance? They mentioned pixel distance, weighted by the attention weights. However, I am not sure about what the pixel distance means.

reyllama commented 2 years ago

@123mutourener Thanks for your comment. Inferring from the paper, the authors are trying to analyze the spatial pattern of each attentional layer. Pixel distance, in my understanding, would refer to either Euclidean (l2) or Manhattan (l1) distance between the query point and the points it is currently attending to. That is, if our query point coordinate is (x1, y1), and it attends to two points (x2, y2) and (x3, y3) with weights 0.6 and 0.4 respectively, the attention distance would be something like 0.6*sqrt((x1-x2)^2+(y1-y2)^2) + 0.4*sqrt((x1-x3)^2+(y1-y3)^2). For exact details, of course we should consult the code base the authors used, which I believe is not publicly open yet, unfortunately.

reyllama / paper-reviews