reyllama / paper-reviews

(very personal) Deep learning literature review
23 stars 0 forks source link

Do Vision Transformers See Like Convolutional Neural Networks? #149

Open reyllama opened 2 years ago

reyllama commented 2 years ago

TL; DR

Motivation

Prior Works

Method

image

Experiments

Representation Similarities across Blocks

image

ViT learns CNN inductive biase from data

image

image

ResNet learns only local information in lower layers

image image

Skip Connections are crucial in ViTs

image image

token enables better spatial localization

image image

ViT scales well with more data

image

Comments

heng-yuwen commented 2 years ago

Hi @reyllama, your summary is fantastic! I wonder if you know the exact way that the authors use to compute the attention distance? They mentioned pixel distance, weighted by the attention weights. However, I am not sure about what the pixel distance means.

reyllama commented 2 years ago

@123mutourener Thanks for your comment. Inferring from the paper, the authors are trying to analyze the spatial pattern of each attentional layer. Pixel distance, in my understanding, would refer to either Euclidean (l2) or Manhattan (l1) distance between the query point and the points it is currently attending to. That is, if our query point coordinate is (x1, y1), and it attends to two points (x2, y2) and (x3, y3) with weights 0.6 and 0.4 respectively, the attention distance would be something like 0.6*sqrt((x1-x2)^2+(y1-y2)^2) + 0.4*sqrt((x1-x3)^2+(y1-y3)^2). For exact details, of course we should consult the code base the authors used, which I believe is not publicly open yet, unfortunately.