microsoft / VideoX

VideoX: a collection of video cross-modal models
Other
976 stars 161 forks source link

Relation between CLIP-X and IFC (Nips 21), TEViT (cvpr22) and ReferFormer (cvpr 22) #56

Closed XLHappy123 closed 2 years ago

XLHappy123 commented 2 years ago

Hi,

I would like to ask what is the relation between your proposed cross-frame attention and the one in IFC [1] and TEViT [2], I consider none of the above papers is cited. In addition, the text token to ReferFormer (cvpr 22).

As the cross-frame communication transformer is considered as a major contribution of the paper, I need to raise a AIV concern.

[1] Video Instance Segmentation using Inter-Frame Communication Transformers [2] Temporally Efficient Vision Transformer for Video Instance Segmentation

nbl97 commented 2 years ago

Thanks for your interest in our work. The three papers you mentioned are all about video segmentation tasks, but our work focuses on video classification, which is a little bit different. We did NOT notice these papers before the submission to ECCV'22 (March 2022). The difference and relationship to the specific papers you mentioned are provided as follows:

If you have further interest, we can provide the detailed git commits, submission history, and experiment logs of our work, which is independent of the others you mentioned while having fundamental differences. My research work during my internship at Microsoft had a very detailed and complete record. Please e-mail me if you need more information (email: nibolin2019@ia.ac.cn). Anyway, thanks for your attention and reminder. We will consider discussing the papers you mentioned in future work. Thanks.

[1] Language as Queries for Referring Video Object Segmentation. In CVPR 2022. [2] Video Instance Segmentation using Inter-Frame Communication Transformers. In NeurIPS 2021. [3] Temporally Efficient Vision Transformer for Video Instance Segmentation. In CVPR 2022. [4] Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In T-PAMI 2021. [5] Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In CVPR 2021. [6] CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv preprint.

nbl97 commented 2 years ago

If there is no other question, I will close this issue. Pls feel free to ping me through email: nibolin2019@ia.ac.cn