About Deformable Attention

Tsai-chia-hsiang commented 10 months ago

Hi, I would like to ask about the Deformable Attention mechanism in the paper.

I went to the paper DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION and the Deformable Attention compute different sample k for each query.

However, I see that it directly use $A_I\times V_I$ and it seems like all query using the same sampling $n'$ result.

I am a little bit confused about that.

Thank you for your answer.

helq2612 commented 10 months ago

I don't think all the queries are using the same sampling results. The points (offset) sampling is the same as the one in the original deformable attention: using a linear projection to generate offsets for each query, see line 124.

Tsai-chia-hsiang commented 10 months ago

Thanks for the reply. I've seen the code. But what I really want to ask is the mathematical equation in the p.4 eq(5) of the paper.

the formula: $\textbf{DeformAttn}(C_I, P_I, Y_I)=[A_I\cdot V_I]W^{o}_I$ (5)

if $A_I\in R^{n\times n'}$ and $V_I\in R^{n'\times 2d}$., where the $\cdot$ operator I assume that it is the conventional matrix multiplication : (take $n'=4$ as an example)

$$AI = \begin{bmatrix} \textbf{query} 1 = a{q{10}} & a{q{11}} & a{q{12}} & a{q{13}} \ ... & ... & ... & ... & \ \textbf{query} n = a{q{n0}} & a{q{n1}} & a{q{n2}} & a{q{n3}} \end{bmatrix}{n\times 4}$$

$$VI=\begin{bmatrix} v{1d1} & v{1d2} & ... & v{1d{256}} \ v{2d1} & v{2d2} & ... & v{2d{256}} \ v{3d1} & v{3d2} & ... & v{3d{256}} \ v{4d1} & v{4d2} & ... & v{4d{256}} \end{bmatrix}{4 \times 2d}$$

and $A_I \times V_I=$

$$\begin{bmatrix} a_{\textbf{query1}} \cdot VI \ ... \ a{\textbf{queryn}} \cdot VI \ \end{bmatrix}{n\times 2d}$$

and seems like each query $\in{1,2,... ,n}$ just get attention from the same $V_I$ ?

Thanks.

helq2612 commented 10 months ago

Yes, the same context V.

Tsai-chia-hsiang commented 10 months ago

But shouldn't it be different $V_I$ for different query $q\in 1,2,...,n$ ?

The origin equation of one head $m$ deformable attention from DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION :

$\displaystyle\sum{k=1}^{K}A{mqk}W_m'x(pq+\Delta{p_{mqk}})$, The $V_I$ should be different.

$$\textbf{query} q = \begin{bmatrix} a{q{0}} & a{q{1}} & a{q{2}} & a{q{3}} \end{bmatrix}_{1\times 4}$$

get attention for a unique $V_{I_q}\in R^{n'\times 2d}$

i.e. $V_I$ should has the shape like $(n \times n'\times 2d)$, n is for n query, and each query has $(n'\times 2d)$ samples points.

and maybe the deformable attention should be written something like $A{n \times n'} \cdot V{I{n \times n' \times 2d}}$ instead of $A{n \times n'}\cdot V{I{n' \times 2d}}$ ?

helq2612 commented 10 months ago

Oh yes, if you think the V is the one before points sampling, it is the same V for all the queries. For the deformable attention, each query get different Vs based on different sampled point sets (so that Vs are different for each query).

Tsai-chia-hsiang commented 10 months ago

"if you think the V is the one before points sampling, it is the same V for all the queries".

I think the $V_I$ from the eq(5) is AFTER the points sampling step.

So I guess this code is using the original deformable detr that for each query gets a unique $Vq$ and the meaning of eq(5) :
$A{n' \times n}\cdot V_{n'\times 2d}$

should be treated as

$A{n' \times n}\cdot V{n\times n'\times 2d}$

makes better sense?

Or this paper wants to use the same $V_I$ for all $q \in 1, 2, ... n$ that different from DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION ?

helq2612 commented 10 months ago

Yes, you are right, we use the same V as original deformableDetr, so in eq5 it should be written as a summation of the dot product (over $n^\prime$ points for each query). Thank you for pointing out this!

Tsai-chia-hsiang commented 10 months ago

Ok, I see. Thanks ~

helq2612 / BiADT

About Deformable Attention #3