Question about STQI head method and implementation

Hi,

Thanks for uploading the code and for a great paper. I have a few questions about the method as I've been reading the paper but found it difficult to understand from the codebase its implementation.

The STQI decoder has a DynConv layer for all N_H STQI heads. Is this DynConv layer within each STQI head the same as in QueryInst? i.e. q_t <-- DynConv_box(p_box, q_t-1). where p_box are ROI-pooled instance features
In the STQI figure in the paper (Figure 1) - there is just 1 Dynamic Conv per head. In QueryInst there are both dynamic mask and dynamic box layers for each stage. Can you confirm there is only DynConv_box in STQI?
The features from either MsgShiftT or Swin are multi-scale. How are the multi-resolution features dealt with in DynConv_box or DynConv_mask. I can't find this information in the manuscript. Do you make predictions for every scale like in an FPN network?
Do the N_H STQI-heads replace the 6 stages you might have in QueryInst?

Many thanks!

hustvl / TeViT

Question about STQI head method and implementation #11