First of all, thank you for releasing this amazing work + trained models that are not easily obtainable by many researcher!
I do have some conceptual questions regarding your work (the questions themselves may not be that closely related ... ). I hope they make sense:
If the objective of TSM is to partially shuffle 2D spatial features for temporal cues, why not considering only inserting temporal shift at the last block/layer of ResNet?
(Related to 1). Say at block 1 of ResNet, a partially shifted feature is obtained (which in theory should already take into account temporal cues from multiple frames), then what is the intuition of further shifting at block 2 (3, 4, ... and so on)?
In the case of TSM for video object detection, I read from another issue that 8 frames were used to train the model. I wonder, are the groundtruths (class, box coordinates) of ALL 8 frames be considered? Or did you perform certain temporal pooling technique and only take the groundtrurh of a particular frame?
First of all, thank you for releasing this amazing work + trained models that are not easily obtainable by many researcher!
I do have some conceptual questions regarding your work (the questions themselves may not be that closely related ... ). I hope they make sense:
If the objective of TSM is to partially shuffle 2D spatial features for temporal cues, why not considering only inserting temporal shift at the last block/layer of ResNet?
(Related to 1). Say at block 1 of ResNet, a partially shifted feature is obtained (which in theory should already take into account temporal cues from multiple frames), then what is the intuition of further shifting at block 2 (3, 4, ... and so on)?
In the case of TSM for video object detection, I read from another issue that 8 frames were used to train the model. I wonder, are the groundtruths (class, box coordinates) of ALL 8 frames be considered? Or did you perform certain temporal pooling technique and only take the groundtrurh of a particular frame?
Thank you in advance!