Open Tsai-chia-hsiang opened 10 months ago
I don't think all the queries are using the same sampling results. The points (offset) sampling is the same as the one in the original deformable attention: using a linear projection to generate offsets for each query, see line 124.
Thanks for the reply. I've seen the code. But what I really want to ask is the mathematical equation in the p.4 eq(5) of the paper.
the formula: $\textbf{DeformAttn}(C_I, P_I, Y_I)=[A_I\cdot V_I]W^{o}_I$ (5)
if $A_I\in R^{n\times n'}$ and $V_I\in R^{n'\times 2d}$., where the $\cdot$ operator I assume that it is the conventional matrix multiplication : (take $n'=4$ as an example)
$$AI = \begin{bmatrix} \textbf{query} 1 = a{q{10}} & a{q{11}} & a{q{12}} & a{q{13}} \ ... & ... & ... & ... & \ \textbf{query} n = a{q{n0}} & a{q{n1}} & a{q{n2}} & a{q{n3}} \end{bmatrix}{n\times 4}$$
$$VI=\begin{bmatrix} v{1d1} & v{1d2} & ... & v{1d{256}} \ v{2d1} & v{2d2} & ... & v{2d{256}} \ v{3d1} & v{3d2} & ... & v{3d{256}} \ v{4d1} & v{4d2} & ... & v{4d{256}} \end{bmatrix}{4 \times 2d}$$
and $A_I \times V_I=$
$$\begin{bmatrix} a_{\textbf{query1}} \cdot VI \ ... \ a{\textbf{queryn}} \cdot VI \ \end{bmatrix}{n\times 2d}$$
and seems like each query $\in{1,2,... ,n}$ just get attention from the same $V_I$ ?
Thanks.
Yes, the same context V.
But shouldn't it be different $V_I$ for different query $q\in 1,2,...,n$ ?
The origin equation of one head $m$ deformable attention from DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION :
$\displaystyle\sum{k=1}^{K}A{mqk}W_m'x(pq+\Delta{p_{mqk}})$, The $V_I$ should be different.
$$\textbf{query} q = \begin{bmatrix} a{q{0}} & a{q{1}} & a{q{2}} & a{q{3}} \end{bmatrix}_{1\times 4}$$
get attention for a unique $V_{I_q}\in R^{n'\times 2d}$
i.e. $V_I$ should has the shape like $(n \times n'\times 2d)$, n is for n query, and each query has $(n'\times 2d)$ samples points.
and maybe the deformable attention should be written something like $A{n \times n'} \cdot V{I{n \times n' \times 2d}}$ instead of $A{n \times n'}\cdot V{I{n' \times 2d}}$ ?
Oh yes, if you think the V is the one before points sampling, it is the same V for all the queries. For the deformable attention, each query get different Vs based on different sampled point sets (so that Vs are different for each query).
"if you think the V is the one before points sampling, it is the same V for all the queries".
I think the $V_I$ from the eq(5) is AFTER the points sampling step.
So I guess this code is using the original deformable detr that for each query gets a unique $Vq$ and the meaning of eq(5) :
$A{n' \times n}\cdot V_{n'\times 2d}$
should be treated as
$A{n' \times n}\cdot V{n\times n'\times 2d}$
makes better sense?
Or this paper wants to use the same $V_I$ for all $q \in 1, 2, ... n$ that different from DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION ?
Yes, you are right, we use the same V as original deformableDetr, so in eq5 it should be written as a summation of the dot product (over $n^\prime$ points for each query). Thank you for pointing out this!
Ok, I see. Thanks ~
Hi, I would like to ask about the Deformable Attention mechanism in the paper.
I went to the paper DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION and the Deformable Attention compute different sample k for each query.
However, I see that it directly use $A_I\times V_I$ and it seems like all query using the same sampling $n'$ result.
I am a little bit confused about that.
Thank you for your answer.