Closed suzana-rita closed 1 year ago
First, there is common sense in the area of video understanding: since a video is made of a sequence of consecutive frames (i.e., 2D images), given the weights of current 2D image classification models, such as Resnet and VIT, we always hope our 3D models to leverage these pertrainined weights of 2D models, which already have the powerful capability in spatial modeling. Video understanding requires the model to have both spatial and temporal modeling capabilities. Initialization from 2D models benefits the first capability.
You are using Resnet50 with output dimension of 2048.
To leverage these pretrained weights, we must align our 3D models with the original 2D models in the dimension of the features; otherwise, we can't load the 2D weights onto our 3D models.
Of course, you can ignore the pretrained weights. For example, if you want the two models to have the same dimension (let's say 512) of the final features. you can set depth=34 and in_channels=512 for I3D, and embed_dims=512 and in_channels=512 for TimeSformer.
The output shapes of I3D and TimeSformer are [batch_size num_segs, 2048, T, H, W] and [batch_size num_segs, 768], respectively. The batch_size is 1, and num_segs is 10x3 (i.e., in the test_pipeline, num_clips=10 in the SampleFrames
and crops=3 in the ThreeCrop
). For the output of the I3D, we will average the features in the dimension of [num_segs, T, H, W], so the final shape will be 1 x 2048 in your case. However, we do not process the output of TimeSformer. You can just average the I3D features in the dimension of HW so that the number of features of the two models will be equivalent. The code for feature extraction.
Sorry, my question was about the number of feature vectors, not about the dimension of the feature vectors after the feature extraction.
So, I do not mind their dimensions to be different, I only mind the number of vectors extracted in the feature extraction.
So I'll make my questions again and try to explain better:
1. Why this difference in the NUMBER of features extracted, wasn't it supposed to be the same? As explained in my original post, I3D extracts only 1 (only one vector) x 2048 vector, while TimeSFormer extracts 30 (thirty vectors) x 768 vectors.
2. Can I also set the I3D to extract the same number of features as TimeSformer? I tried to change the num_clips in the test_pipeline but the I3D keeps extracting a unique vector even after 10 clips x 3 crops. This information is extracted from this Faq testing and Clip level feature extraction
Sorry, my question was about the number of feature vectors, not about the dimension of the feature vectors after the feature extraction.
So, I do not mind their dimensions to be different, I only mind the number of vectors extracted in the feature extraction.
So I'll make my questions again and try to explain better:
1. Why this difference in the NUMBER of features extracted, wasn't it supposed to be the same? As explained in my original post, I3D extracts only 1 (only one vector) x 2048 vector, while TimeSFormer extracts 30 (thirty vectors) x 768 vectors.
2. Can I also set the I3D to extract the same number of features as TimeSformer? I tried to change the num_clips in the test_pipeline but the I3D keeps extracting a unique vector even after 10 clips x 3 crops. This information is extracted from this Faq testing and Clip level feature extraction
Okay, I assume my first answer is practicing my poor English, haha.
Thank you very much for the answer.
So, in this case, I only have to comment this part of the link you sent?
if feat_dim == 5: # 3D-CNN architecture
# perform spatio-temporal pooling
avg_pool = nn.AdaptiveAvgPool3d(1)
if isinstance(feat, tuple):
feat = [avg_pool(x) for x in feat]
# concat them
feat = torch.cat(feat, axis=1)
else:
feat = avg_pool(feat)
# squeeze dimensions
feat = feat.reshape((batches, num_segs, -1))
# temporal average pooling
feat = feat.mean(axis=1)
Besides, just to make sure, could you tell me what layers the current framework extracts the features? You chop the head and extract the feature from the layer immediately before the head?
Just comment these two lines.
Yes, the features are extracted from the backbone ahead of the head.
I've commented the two lines some minutes ago and it worked as I wanted. Now, I have a 2048 (dim) x 30 (feature vectors).
Thank you very much for your help!
Hi guys,
Recently, I've started to extract features using the I3D and TimeSformer models that were finetuned by me for the UCFSports (10 classes) dataset.
To build the extractors, I followed the slowonly net example in this url with some adaptations for the I3D and the TimeSformer which are the models I am using right now.
The problem is: When I extract features for only one video using the TimeSformer, I get a file of 768 (feature dimension) x 30 (number of features). Now, with I3D, I always get 2048 x 1 (number of features).
My questions are:
Here is the training and testing pipelines I used for TimeSformer:
And this is the training and testing pipelines I used for I3D:
Some numbers are different because I also kept the values used in the original config files of both neural nets.