mit-han-lab / temporal-shift-module

[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
https://arxiv.org/abs/1811.08383
MIT License
2.07k stars 417 forks source link

Temporal shifts at multiple blocks / video object detection #137

Open alphadadajuju opened 4 years ago

alphadadajuju commented 4 years ago

First of all, thank you for releasing this amazing work + trained models that are not easily obtainable by many researcher!

I do have some conceptual questions regarding your work (the questions themselves may not be that closely related ... ). I hope they make sense:

  1. If the objective of TSM is to partially shuffle 2D spatial features for temporal cues, why not considering only inserting temporal shift at the last block/layer of ResNet?

  2. (Related to 1). Say at block 1 of ResNet, a partially shifted feature is obtained (which in theory should already take into account temporal cues from multiple frames), then what is the intuition of further shifting at block 2 (3, 4, ... and so on)?

  3. In the case of TSM for video object detection, I read from another issue that 8 frames were used to train the model. I wonder, are the groundtruths (class, box coordinates) of ALL 8 frames be considered? Or did you perform certain temporal pooling technique and only take the groundtrurh of a particular frame?

Thank you in advance!

ahamid123 commented 1 year ago

Hi i am interested in point 3. Do you have an answer?