Closed XLHappy123 closed 2 years ago
Thanks for your interest in our work. The three papers you mentioned are all about video segmentation tasks, but our work focuses on video classification, which is a little bit different. We did NOT notice these papers before the submission to ECCV'22 (March 2022). The difference and relationship to the specific papers you mentioned are provided as follows:
ReferFormer[1]:
IFC[2] and TeViT[3]:
Timeline: We first try the idea of message token in 2021/5/4. IFC was available on arXiv on 2021/6/7(https://arxiv.org/abs/2106.03299), and TeViT was on 2022/4/18(https://arxiv.org/abs/2204.08412 , even after ECCV'22 submission deadline). It can be seen that our work is independent of these two works from the following figures.
[Git commits of the idea of message token. The implementation in X-CLIP is an extension of v3, while IFC and TeViT employ v1.]
[arXiv submission history of IFC[2]]
[arXiv submission history of TeViT[3]]
If you have further interest, we can provide the detailed git commits, submission history, and experiment logs of our work, which is independent of the others you mentioned while having fundamental differences. My research work during my internship at Microsoft had a very detailed and complete record. Please e-mail me if you need more information (email: nibolin2019@ia.ac.cn). Anyway, thanks for your attention and reminder. We will consider discussing the papers you mentioned in future work. Thanks.
[1] Language as Queries for Referring Video Object Segmentation. In CVPR 2022. [2] Video Instance Segmentation using Inter-Frame Communication Transformers. In NeurIPS 2021. [3] Temporally Efficient Vision Transformer for Video Instance Segmentation. In CVPR 2022. [4] Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In T-PAMI 2021. [5] Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In CVPR 2021. [6] CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv preprint.
If there is no other question, I will close this issue. Pls feel free to ping me through email: nibolin2019@ia.ac.cn
Hi,
I would like to ask what is the relation between your proposed cross-frame attention and the one in IFC [1] and TEViT [2], I consider none of the above papers is cited. In addition, the text token to ReferFormer (cvpr 22).
As the cross-frame communication transformer is considered as a major contribution of the paper, I need to raise a AIV concern.
[1] Video Instance Segmentation using Inter-Frame Communication Transformers [2] Temporally Efficient Vision Transformer for Video Instance Segmentation