Some clarifications on the architecture

uriyapes commented 2 years ago

Hi, I've just read the paper and the idea and results are impressive. I know you haven't release the source code yet but since I want to present this paper in my workplace I would like to better understand the architecture of your network. I have two questions:

What are the dimensions of the queries in the map decoder? I understood there are Nv point-level queries and N instance level queries and that for each instance query i we add the different point queries to get the hierarchical query, so in the end there are Nv x N hierarchical queries. What I wish to know is what is the dimension of each hierarchical query q^{hie}_{ij}?
In the map decoder you say that each query q^{hie}_{ij} predicts the 2-dimension normalized BEV coordinate (xij ; yij) of the reference point pij. Is this done as described in Deformable attention paper? Meaning, for each object query, the 2-d normalized coordinate of the reference point is predicted from its object query embedding via a learnable linear projection followed by a sigmoid function.

Many thanks in advance.

LegendBC commented 2 years ago

Hi @uriyapes , thanks for your interest in our work and good questions!

What I wish to know is what is the dimension of each hierarchical query q^{hie}_{ij}?

The dimension of each query is set to 256.

Is this done as described in the Deformable Attention paper?

Yes. We have not changed the intrinsic mechanism of Deformable Attention.

LengYu commented 1 year ago

I got two questions：

“In each decoder layer, we adopt MHSA to make hierarchical queries exchange information with each other (both inter-instance and intra-instance)” So, inter-instance and intra-instance interaction in one MHSA layer? or two layers?
The initial quires are learnable params, right?

LegendBC commented 1 year ago

So, inter-instance and intra-instance interaction in one MHSA layer? or two layers?

We perform the interaction in one MHSA layer.

The initial quires are learnable params, right?

Yes, the initial queries are learnable embeddings.

And we release an initial version of MapTR, you can refer to the code for more details. I'm closing this issue, but let us know if you have further questions.

hustvl / MapTR

Some clarifications on the architecture #4