CLIP or HERO feature extraction

jayleicn / moment_detr

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset

https://arxiv.org/abs/2107.09609

MIT License

259 stars 44 forks source link

CLIP or HERO feature extraction #30

Closed Rj-batista closed 1 year ago

Rj-batista commented 1 year ago

Hi,

I am a little confuse about feature extraction If I am correct there is two kind of features : CLIP OPEN AI and HERO_VIDEO_FEATURE_EXTRACTOR I wanted to know the difference between those two and the purpose of CLIP ? Also I have run HERO_VIDEO_FEATURE_EXTRACTOR and i am left with 4 files :

clip-vit_feature
mil-nce_feature
resnet_feature
slowfast_feature In this repo at the features file there is 4 files too:
clip_feature
clip_sub_feature
clip_text_feature
slowfast_feature Can you tell me which file match between those two list of files ? (of course slowfast_feature is the first obvious match i presume)

Thank you

jayleicn commented 1 year ago

HERO_VIDEO_FEATURE_EXTRACTOR is a simple codebase to extract features for the video feature (including CLIP and slowfast) we used in this codebase.

Rj-batista commented 1 year ago

Ok, it's clearer for the first question i appreciate your prompt response. Is it possible for you to tell me which file match from the HERO_VIDEO_FEATURE_EXTRACTOR and what you give us to download on the Prepare feature files section from your README, because there is no indication about what file is at the output of HERO_VIDEO_FEATURE_EXTRACTOR From HERO_VIDEO_FEATURE_EXTRACTOR with my own video:

clip-vit_feature
mil-nce_feature
resnet_feature

From moment_detr_features.tar.gz

clip_feature
clip_sub_feature
clip_text_feature

Thank you

jayleicn commented 1 year ago

All the 3 features listed are from clip-vit_feature,clip_sub_feature is only used for pre-training, clip_feature is video frame features , clip_text_feature is the feature for user text queries. Besides, you can actually look at our inference example here to figure out what exact features are used https://github.com/jayleicn/moment_detr/tree/main/run_on_video, but note this is a simplified model which does not use slowfast features for videos.

Rj-batista commented 1 year ago

Iam sorry it's still confusing for me because i want to train your model with my own data How do you generate clip_feature, clip_sub_feature, clip_text_feature from HERO_Video_Feature_Extractor I presume that you need to use this section Image-text pre-trained CLIP features of the repo to generate those 3 files The issue is that after running the docker image with my own data I am left with only one npz file in a folder called clip-vit_feature So how do you generate those folders in order to train your model ? Thank you for your reponse

jayleicn commented 1 year ago

clip_feature is the vision feature clip-vit_feature. clip_sub_feature and clip_text_feature are both text features, you will need to create your own script to extract, following what shown in this demo. clip_text_feature is the extract CLIP text feature for user query, clip_sub_feature is the extracted text feature for video subtitles, they are from the same CLIP text encoder.

Rj-batista commented 1 year ago

Ok so clip_feature is generate with clip-vit_feature. For clip_sub_feature and clip_text_feature should i use this script to generate feature them ?

Thanks

Rj-batista commented 1 year ago

Sry for the delay but i have found how it work now Thanks for all of your answer and great work btw !!

XiaohuJoshua commented 1 month ago

Sry for the delay but i have found how it work now Thanks for all of your answer and great work btw !!

Excuse me, would you be willing to share the code for extracting qvhighlights text features?