Open reyllama opened 2 years ago
Hi @reyllama, your summary is fantastic! I wonder if you know the exact way that the authors use to compute the attention distance? They mentioned pixel distance, weighted by the attention weights. However, I am not sure about what the pixel distance means.
@123mutourener Thanks for your comment. Inferring from the paper, the authors are trying to analyze the spatial pattern of each attentional layer. Pixel distance, in my understanding, would refer to either Euclidean (l2) or Manhattan (l1) distance between the query point and the points it is currently attending to. That is, if our query point coordinate is (x1, y1), and it attends to two points (x2, y2) and (x3, y3) with weights 0.6 and 0.4 respectively, the attention distance would be something like 0.6*sqrt((x1-x2)^2+(y1-y2)^2) + 0.4*sqrt((x1-x3)^2+(y1-y3)^2). For exact details, of course we should consult the code base the authors used, which I believe is not publicly open yet, unfortunately.
TL; DR
Motivation
Prior Works
Method
Experiments
Representation Similarities across Blocks
ViT learns CNN inductive biase from data
ResNet learns only local information in lower layers
Skip Connections are crucial in ViTs
ViT scales well with more data
Comments