simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

Video 100m feature extraction #17

Closed mireiahernandez closed 3 years ago

mireiahernandez commented 3 years ago

Hi, Thank you for sharing your work and congrats on the paper.

I am trying to extract 100m video features using the video embedding network provided by Miech et al., 2020 (https://github.com/antoine77340/S3D_HowTo100M). In the paper you mention that you sample 0.6 frames/second, however I can't figure out how to obtain the "frame features" using this network. Could you explain it in more detail? Thank you in advance.

simon-ging commented 3 years ago

Hi,

So this is what's done by the S3D authors (Miech et al.) and what we are currently doing:

We extract frames at 16 FPS by first cropping of the edges to make the frame square and then resize to 256x256px. Then we feed a window of 32 frames at a time into the S3D model and save the resulting 512-dim vector as one "frame feature". We move the window by a stride of 16 and take the next 32, until the video is over. This results in about 1 FPS of frame features.

For the provided features and our paper we used 10 FPS and 224x224px which results in about 0.6 FPS.

The first approach has slightly better performance on our YouCook2 Video Retrieval experiments.

Best

mireiahernandez commented 3 years ago

Thank you for your response!

Best

Maddy12 commented 2 years ago

Hi,

So this is what's done by the S3D authors (Miech et al.) and what we are currently doing:

We extract frames at 16 FPS by first cropping of the edges to make the frame square and then resize to 256x256px. Then we feed a window of 32 frames at a time into the S3D model and save the resulting 512-dim vector as one "frame feature". We move the window by a stride of 16 and take the next 32, until the video is over. This results in about 1 FPS of frame features.

For the provided features and our paper we used 10 FPS and 224x224px which results in about 0.6 FPS.

The first approach has slightly better performance on our YouCook2 Video Retrieval experiments.

Best

Is this the same procedure as in VideoFeatureExtractor or is it different?

simon-ging commented 2 years ago

The model they are using for extraction is the same (s3d_howto100m.pth), for the parameters/cropping I don't know since I have never used that repository.

Maddy12 commented 2 years ago

Ok. And how are the features stored in the H5 file, I cannot see a script that helps define this either?

simon-ging commented 2 years ago

Given video_key as str and data as numpy array of shape (num_feature_frames, model_dim):

# open file
h5 = h5py.File("my_file.h5", "w")

# loop over videos to write ...
# write data
h5[video_key] = data

# close file
h5.close()
simon-ging commented 2 years ago

Added the feature extraction code for Howto100m (S3D) features, see the readme chapter "Running your own video dataset on the trained models".

Kamino666 commented 2 years ago

您好!您的邮件我已收到!