Thanks for uploading the code and for a great paper. I have a few questions about the method as I've been reading the paper but found it difficult to understand from the codebase its implementation.
The STQI decoder has a DynConv layer for all N_H STQI heads. Is this DynConv layer within each STQI head the same as in QueryInst? i.e. q_t <-- DynConv_box(p_box, q_t-1). where p_box are ROI-pooled instance features
In the STQI figure in the paper (Figure 1) - there is just 1 Dynamic Conv per head. In QueryInst there are both dynamic mask and dynamic box layers for each stage. Can you confirm there is only DynConv_box in STQI?
The features from either MsgShiftT or Swin are multi-scale. How are the multi-resolution features dealt with in DynConv_box or DynConv_mask. I can't find this information in the manuscript. Do you make predictions for every scale like in an FPN network?
Do the N_H STQI-heads replace the 6 stages you might have in QueryInst?
Hi,
Thanks for uploading the code and for a great paper. I have a few questions about the method as I've been reading the paper but found it difficult to understand from the codebase its implementation.
DynConv
layer for allN_H
STQI heads. Is thisDynConv
layer within each STQI head the same as in QueryInst? i.e.q_t <-- DynConv_box(p_box, q_t-1)
. wherep_box
are ROI-pooled instance featuresDynamic Conv
per head. InQueryInst
there are both dynamic mask and dynamic box layers for each stage. Can you confirm there is onlyDynConv_box
in STQI?MsgShiftT
orSwin
are multi-scale. How are the multi-resolution features dealt with inDynConv_box
orDynConv_mask
. I can't find this information in the manuscript. Do you make predictions for every scale like in an FPN network?N_H
STQI-heads replace the 6 stages you might have inQueryInst
?Many thanks!