movienet / movienet-tools

Tools for movie and video research
http://movienet.github.io
271 stars 31 forks source link

Problems extracting frames for feature extraction #1

Closed albertaparicio closed 4 years ago

albertaparicio commented 4 years ago

I am trying to extract action features for a video. I am having problems figuring out how the video frames are to be extracted.

When running scripts/extract_action_feats.py, it expects frames to be saved in ${movienet_root}/frames/shot_number/frame_number, but I cannot find what module extracts all needed frames in this format.

The closest I have reached is to extract many keyframes in demos/detect_shots.py, but the format is not the right one.

Could you point me out to the right way, please?

ycxioooong commented 4 years ago

Hi there,

Currently, we only support videos that have been decomposed into frames. Also, one should also provide shot detection results using our shot detector.

We will support extract action features directly from video files soon.

ycxioooong commented 4 years ago

Hi there,

We are now supporting extract action features directly using videos. Please see the source code at https://github.com/movienet/movienet-tools/blob/master/scripts/extract_action_feats.py

Originally we use VideoFileBackend as the video frame reader, it requires users to first convert video into frames. Now you could use VideoMMCVBackend that directly leverage video file to extract frames.

# == previous version ==
video = VideoFileBackend(
    'twolevel',
    osp.join(args.movienet_root, 'frame', movie_id),
    shot_file=shot_file)

# == new option ==
video = VideoMMCVBackend(
    osp.join(args.movienet_root, 'video', f"{movie_id}.mp4"))

You could customize this script to adapt to any video sources you want.

albertaparicio commented 4 years ago

I just tried out the new code and I have been able to extract action features for a video.

I do not, however, fully comprehend what features have been extracted and how can I use them. I was wondering if it would be possible to schedule a videocall with you, so we can discuss this in detail.

I am thoroughly interested in evaluating this model for my researching efforts in activity classification in video.

You can contact me directly at albert.aparicio.-nd@disneyresearch.com

Thank you

ycxioooong commented 4 years ago

Sure if you want.

And for other users who are interested in the feature extractors, I explain here the detail of the action feature extractor. First of all, the model structure of this action extractor is a spatial-temporal action detection model based on Fast-RCNN NonLocal-I3D-50. The model is pre-trained on AVA dataset. After running this action extractor, for each shot, one would obtain the following results:

The workflow of the extractor on each shot is: (1) For a shot with N frames, it will first uniformly sample one/multiple frame sequence with the length M. For example, if N=80 and M=64, the sampled frame sequences would be [0, 63], [16, 79]. (2) Then, if the tracklet of this shot is not provided, the extractor will call a person detector to first detect persons for the middle frame of each frame sequence, e.g., frame #31 for frame sequence [0,63] and #47 for frame sequence [16, 79]. (3) After person bounding boxes detected, the bbox IoUs between two frames will be calculated and bi-partite matching will be conducted over these bboxes so as to form tracklets. (4) The next step is to input frame sequences and the detected person instance bboxes into the model to obtain the instance-wise action feature. (5) After we extract all the features from each frame sequence, we average these features for each tracklet to generate the final action features of certain person instances appear in a single shot.