Relation between CLIP-X and IFC (Nips 21), TEViT (cvpr22) and ReferFormer (cvpr 22)

XLHappy123 commented 2 years ago

Hi,

I would like to ask what is the relation between your proposed cross-frame attention and the one in IFC [1] and TEViT [2], I consider none of the above papers is cited. In addition, the text token to ReferFormer (cvpr 22).

As the cross-frame communication transformer is considered as a major contribution of the paper, I need to raise a AIV concern.

[1] Video Instance Segmentation using Inter-Frame Communication Transformers [2] Temporally Efficient Vision Transformer for Video Instance Segmentation

nbl97 commented 2 years ago

Thanks for your interest in our work. The three papers you mentioned are all about video segmentation tasks, but our work focuses on video classification, which is a little bit different. We did NOT notice these papers before the submission to ECCV'22 (March 2022). The difference and relationship to the specific papers you mentioned are provided as follows:

ReferFormer[1]:
- Difference: Using text information in the video domain is a common idea, many recent works [4,5,6] consider leveraging textual info for video modeling. The motivation and technical details are different from ReferFormer. Specifically, ReferFormer uses text as reference to find the most relevant regions for video object segmentation. In contrast, the text in our method can be treated as a classification head and our proposed video-specific prompting is to enhance the classification head for better discriminability by leveraging video content information.
IFC[2] and TeViT[3]:
- Timeline: We first try the idea of message token in 2021/5/4. IFC was available on arXiv on 2021/6/7(https://arxiv.org/abs/2106.03299), and TeViT was on 2022/4/18(https://arxiv.org/abs/2204.08412 , even after ECCV'22 submission deadline). It can be seen that our work is independent of these two works from the following figures.
  
  [Git commits of the idea of message token. The implementation in X-CLIP is an extension of v3, while IFC and TeViT employ v1.]
  
  [arXiv submission history of IFC[2]]
  
  [arXiv submission history of TeViT[3]]
- Relationship: Our method differs from the above works in two aspects: 1) The message tokens in our work are built in an online fashion in each layer, while IFC predefines the memory tokens and the memory tokens go through the whole network. 2) Compared to the Encode-Receive(attn+FFN) in IFC, our cross-frame fusion attention is not followed by an FFN layer and still works well. We attributed this superiority to the online generation of message tokens. The key idea of our message token is to leverage the strong representation ability of pretrained models while not destroying the input patterns as much as possible, which is essentially different from the other works.

If you have further interest, we can provide the detailed git commits, submission history, and experiment logs of our work, which is independent of the others you mentioned while having fundamental differences. My research work during my internship at Microsoft had a very detailed and complete record. Please e-mail me if you need more information (email: nibolin2019@ia.ac.cn). Anyway, thanks for your attention and reminder. We will consider discussing the papers you mentioned in future work. Thanks.

[1] Language as Queries for Referring Video Object Segmentation. In CVPR 2022. [2] Video Instance Segmentation using Inter-Frame Communication Transformers. In NeurIPS 2021. [3] Temporally Efficient Vision Transformer for Video Instance Segmentation. In CVPR 2022. [4] Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In T-PAMI 2021. [5] Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In CVPR 2021. [6] CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv preprint.

nbl97 commented 2 years ago

If there is no other question, I will close this issue. Pls feel free to ping me through email: nibolin2019@ia.ac.cn

microsoft / VideoX

Relation between CLIP-X and IFC (Nips 21), TEViT (cvpr22) and ReferFormer (cvpr 22) #56