hustvl / TeViT

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral
https://arxiv.org/abs/2204.08412
MIT License
238 stars 17 forks source link

Question about STQI head method and implementation #11

Closed fbragman closed 1 year ago

fbragman commented 1 year ago

Hi,

Thanks for uploading the code and for a great paper. I have a few questions about the method as I've been reading the paper but found it difficult to understand from the codebase its implementation.

  1. The STQI decoder has a DynConv layer for all N_H STQI heads. Is this DynConv layer within each STQI head the same as in QueryInst? i.e. q_t <-- DynConv_box(p_box, q_t-1). where p_box are ROI-pooled instance features
  2. In the STQI figure in the paper (Figure 1) - there is just 1 Dynamic Conv per head. In QueryInst there are both dynamic mask and dynamic box layers for each stage. Can you confirm there is only DynConv_box in STQI?
  3. The features from either MsgShiftT or Swin are multi-scale. How are the multi-resolution features dealt with in DynConv_box or DynConv_mask. I can't find this information in the manuscript. Do you make predictions for every scale like in an FPN network?
  4. Do the N_H STQI-heads replace the 6 stages you might have in QueryInst?

Many thanks!