jayleicn / moment_detr

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset
https://arxiv.org/abs/2107.09609
MIT License
259 stars 44 forks source link

Question Regarding Feature Extraction Discrepancy Between Training & Inference #26

Closed rsomani95 closed 1 year ago

rsomani95 commented 1 year ago

Hello. Firstly, congratulations thank you for sharing this work, it's really cool!

I had a question regarding feature extraction. In the paper and the training script, train.sh suggests that there's two sets of video features being used -- SlowFast and CLIP. I confirmed that the shared moment_detr_features.tar.gz file has both the SlowFast & CLIP features available as well.

However, in the inference script run.py, only the ClipFeatureExtractor is used. Do we not need SlowFast features during inference? Or am I missing something?

jayleicn commented 1 year ago

Hi @rsomani95,

Thank you for your kinds words. The run.py script is used as an easy-to-run demo with as few dependencies as possible, thus we removed the slowfast feature and only rely on the CLIP feature due to it's easier deployment process. For best performance, both slwofast and CLIP features are needed.

Best, Jie

rsomani95 commented 1 year ago

Hi @jayleicn,

Thank you for your response. I see, that's great to know. I'm interested in deploying this model, and as you correctly mentioned, it's much easier to only rely on CLIP features.

Do you have validation scores for the model without SlowFast features? This wasn't reported in the paper, but I'm curious if you ever tried training a model without SlowFast and just on CLIP features?

Thanks!

jayleicn commented 1 year ago

I don't have the exact number as well, as far as I remember, the CLIP only model achieves at least 90-95% of the CLIP+SlowFast model performance, so it is also a very decent model.

rsomani95 commented 1 year ago

Got it. Awesome! That's higher that I'd intuited.

That answers my original question, thank you.