Closed rsomani95 closed 1 year ago
Hi @rsomani95,
Thank you for your kinds words. The run.py
script is used as an easy-to-run demo with as few dependencies as possible, thus we removed the slowfast feature and only rely on the CLIP feature due to it's easier deployment process. For best performance, both slwofast and CLIP features are needed.
Best, Jie
Hi @jayleicn,
Thank you for your response. I see, that's great to know. I'm interested in deploying this model, and as you correctly mentioned, it's much easier to only rely on CLIP features.
Do you have validation scores for the model without SlowFast features? This wasn't reported in the paper, but I'm curious if you ever tried training a model without SlowFast and just on CLIP features?
Thanks!
I don't have the exact number as well, as far as I remember, the CLIP only model achieves at least 90-95% of the CLIP+SlowFast model performance, so it is also a very decent model.
Got it. Awesome! That's higher that I'd intuited.
That answers my original question, thank you.
Hello. Firstly, congratulations thank you for sharing this work, it's really cool!
I had a question regarding feature extraction. In the paper and the training script,
train.sh
suggests that there's two sets of video features being used -- SlowFast and CLIP. I confirmed that the sharedmoment_detr_features.tar.gz
file has both the SlowFast & CLIP features available as well.However, in the inference script
run.py
, only theClipFeatureExtractor
is used. Do we not need SlowFast features during inference? Or am I missing something?