open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.27k stars 1.25k forks source link

Feature extraction using I3D and TimeSformer returns different number of feature vectors #2177

Closed suzana-rita closed 1 year ago

suzana-rita commented 1 year ago

Hi guys,

Recently, I've started to extract features using the I3D and TimeSformer models that were finetuned by me for the UCFSports (10 classes) dataset.

To build the extractors, I followed the slowonly net example in this url with some adaptations for the I3D and the TimeSformer which are the models I am using right now.

The problem is: When I extract features for only one video using the TimeSformer, I get a file of 768 (feature dimension) x 30 (number of features). Now, with I3D, I always get 2048 x 1 (number of features).

My questions are:

  1. Why this difference in the number of features extracted, wasn't it supposed to be the same?
  2. Can I also set the I3D to extract the same number of features as TimeSformer?

Here is the training and testing pipelines I used for TimeSformer:

train_pipeline = [
  dict(type='DecordInit'),
  dict(type='SampleFrames', clip_len=8, frame_interval=32, num_clips=1),
  dict(type='DecordDecode'),
  dict(type='Resize', scale=(-1, 256)),
  dict(type='RandomResizedCrop'),
  dict(type='Resize', scale=(224, 224), keep_ratio=False),
  dict(type='Flip', flip_ratio=0.5),
  dict(type='Normalize', **img_norm_cfg),
  dict(type='FormatShape', input_format='NCTHW'),
  dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
  dict(type='ToTensor', keys=['imgs', 'label'])
]
test_pipeline = [
    dict(type='DecordInit'),
    dict(
        type='SampleFrames',
        clip_len=8,
        frame_interval=32,
        num_clips=10,
        test_mode=True),
    dict(type='DecordDecode'),
    dict(type='Resize', scale=(-1, 224)),
    dict(type='ThreeCrop', crop_size=224),
    dict(type='Flip', flip_ratio=0),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs'])
]

And this is the training and testing pipelines I used for I3D:

train_pipeline = [
    dict(type='DecordInit'),
    dict(type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1),
    dict(type='DecordDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='RandomResizedCrop'),
    dict(type='Resize', scale=(224, 224), keep_ratio=False),
    dict(type='Flip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs', 'label'])
]
test_pipeline = [
    dict(type='DecordInit'),
    dict(
        type='SampleFrames',
        clip_len=32,
        frame_interval=2,
        num_clips=10,
        test_mode=True),
    dict(type='DecordDecode'),
    dict(type='Resize', scale=(-1, 224)),
    dict(type='ThreeCrop', crop_size=224),
    dict(type='Flip', flip_ratio=0),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs'])
]

Some numbers are different because I also kept the values used in the original config files of both neural nets.

Dai-Wenxun commented 1 year ago

First, there is common sense in the area of video understanding: since a video is made of a sequence of consecutive frames (i.e., 2D images), given the weights of current 2D image classification models, such as Resnet and VIT, we always hope our 3D models to leverage these pertrainined weights of 2D models, which already have the powerful capability in spatial modeling. Video understanding requires the model to have both spatial and temporal modeling capabilities. Initialization from 2D models benefits the first capability.

  1. For I3D, it's the inflated version in the time dimension of the original Resnet, so we can take the 2D pretrained weights of Resnet to initialize the I3D model to boost the performance. The variants of Resnet are as follows: image

You are using Resnet50 with output dimension of 2048.

  1. For TimeSformer, it also uses the 2D pretrained weights from VIT.

To leverage these pretrained weights, we must align our 3D models with the original 2D models in the dimension of the features; otherwise, we can't load the 2D weights onto our 3D models.

Of course, you can ignore the pretrained weights. For example, if you want the two models to have the same dimension (let's say 512) of the final features. you can set depth=34 and in_channels=512 for I3D, and embed_dims=512 and in_channels=512 for TimeSformer.

Dai-Wenxun commented 1 year ago

The output shapes of I3D and TimeSformer are [batch_size num_segs, 2048, T, H, W] and [batch_size num_segs, 768], respectively. The batch_size is 1, and num_segs is 10x3 (i.e., in the test_pipeline, num_clips=10 in the SampleFrames and crops=3 in the ThreeCrop). For the output of the I3D, we will average the features in the dimension of [num_segs, T, H, W], so the final shape will be 1 x 2048 in your case. However, we do not process the output of TimeSformer. You can just average the I3D features in the dimension of HW so that the number of features of the two models will be equivalent. The code for feature extraction.

suzana-rita commented 1 year ago

Sorry, my question was about the number of feature vectors, not about the dimension of the feature vectors after the feature extraction.

So, I do not mind their dimensions to be different, I only mind the number of vectors extracted in the feature extraction.

So I'll make my questions again and try to explain better:

1. Why this difference in the NUMBER of features extracted, wasn't it supposed to be the same? As explained in my original post, I3D extracts only 1 (only one vector) x 2048 vector, while TimeSFormer extracts 30 (thirty vectors) x 768 vectors.

2. Can I also set the I3D to extract the same number of features as TimeSformer? I tried to change the num_clips in the test_pipeline but the I3D keeps extracting a unique vector even after 10 clips x 3 crops. This information is extracted from this Faq testing and Clip level feature extraction

Dai-Wenxun commented 1 year ago

Sorry, my question was about the number of feature vectors, not about the dimension of the feature vectors after the feature extraction.

So, I do not mind their dimensions to be different, I only mind the number of vectors extracted in the feature extraction.

So I'll make my questions again and try to explain better:

1. Why this difference in the NUMBER of features extracted, wasn't it supposed to be the same? As explained in my original post, I3D extracts only 1 (only one vector) x 2048 vector, while TimeSFormer extracts 30 (thirty vectors) x 768 vectors.

2. Can I also set the I3D to extract the same number of features as TimeSformer? I tried to change the num_clips in the test_pipeline but the I3D keeps extracting a unique vector even after 10 clips x 3 crops. This information is extracted from this Faq testing and Clip level feature extraction

Okay, I assume my first answer is practicing my poor English, haha.

suzana-rita commented 1 year ago

Thank you very much for the answer.

So, in this case, I only have to comment this part of the link you sent?

if feat_dim == 5:  # 3D-CNN architecture
  # perform spatio-temporal pooling
  avg_pool = nn.AdaptiveAvgPool3d(1)
  if isinstance(feat, tuple):
      feat = [avg_pool(x) for x in feat]
      # concat them
      feat = torch.cat(feat, axis=1)
  else:
      feat = avg_pool(feat)
  # squeeze dimensions
  feat = feat.reshape((batches, num_segs, -1))
  # temporal average pooling
  feat = feat.mean(axis=1)

Besides, just to make sure, could you tell me what layers the current framework extracts the features? You chop the head and extract the feature from the layer immediately before the head?

Dai-Wenxun commented 1 year ago

Just comment these two lines.

Yes, the features are extracted from the backbone ahead of the head.

suzana-rita commented 1 year ago

I've commented the two lines some minutes ago and it worked as I wanted. Now, I have a 2048 (dim) x 30 (feature vectors).

Thank you very much for your help!