sangminwoo / Explore-And-Match

Official pytorch implementation of "Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos"
MIT License
42 stars 2 forks source link
moment-retrieval natural-language-video-localization video-grounding vision-and-language

Explore-And-Match

Implementation of "Explore-And-Match".

Getting Started

:warning: Dependencies:

Dataset Preparation

split

Download ActivityNet

merge 'v1-2' and 'v1-3' into a single folder 'videos'.

Download Charades

Pre-trained features

Preprocess

Get 64/128/256 frames per video:

bash preprocess/get_constant_frames_per_video.sh

Extract features with CLIP

change 'val_1' to 'val' and 'val_2' to 'test' CLIP encodings

bash preprocess/get_clip_features.sh

Train

activitynet, charades

bash train_{dataset}.sh

Evaluation

bash test_{dataset}.sh

Configurations

refer to lib/configs.py

Citation

@article{woo2022explore,
  title={Explore and Match: End-to-End Video Grounding with Transformer},
  author={Woo, Sangmin and Park, Jinyoung and Koo, Inyong and Lee, Sumin and Jeong, Minki and Kim, Changick},
  journal={arXiv preprint arXiv:2201.10168},
  year={2022}
}