naraysa / gzsl-od

Out-of-Distribution Detection for Generalized Zero-Shot Action Recognition
MIT License
54 stars 12 forks source link

About testing code #6

Open GayatriPurandharT opened 4 years ago

GayatriPurandharT commented 4 years ago

Hello @naraysa, it would be really helpful if you provide a demo code for gzsl-od case. Thanks a lot.

naraysa commented 4 years ago

There is no demo code as such. The OD setting can be run using shell script that is already shared in the repo.

GayatriPurandharT commented 4 years ago

I did run the shell script. I wanted to test the model with my own video clip. I think there's a lot of preprocessing to be done before I give the input. It would be really helpful if you describe the steps to extract spatio-temporal features of the test video. Thank you.

naraysa commented 4 years ago

There is no special preprocessing performed on the videos. The steps to extract are detailed in the experimental setup section of the paper. You can extract your own features using any of the publicly available repos, similar to: https://github.com/facebookarchive/C3D https://github.com/piergiaj/pytorch-i3d

GayatriPurandharT commented 4 years ago

Thank you @naraysa, I used https://github.com/piergiaj/pytorch-i3d to extract per-segment rgb-features of a video as numpy arrays. The shape of my RGB-features is (65, 1, 1, 1024). Your self.train_feature.size() in util.py shows train features to be of shape ([5372, 8192]). I see that 5372 is the training sample size and I read from your paper that 8192 is the feature size of appearence+flow features concatenated. I have two questions:

  1. Does your ODD model accept RGB features as input? Or does appearance features mean RGB?
  2. If yes, then can you please confirm from my feature shape (65, 1, 1, 1024) that 1024 is the size of the input feature vector I should be using? Thanks!
GayatriPurandharT commented 4 years ago

It is mentioned in experimental setup section of your paper that:

For an input video, the Mixed_5c output of both networks are averaged across the temporal dimension and pooled by 4 in the spatial dimension and then flattened to obtain a vector, of size 4096, representing the appearance and flow features, respectively. The appearance and flow features are concatenated to obtain video features of size 8192.

So, one option I have is to adjust the input feature size from 8192 to 1024 (the size of RGB-features I obtained) and train the networks from scratch. To carry out this, can you please share how i3d.mat file that was provided in the drive link can be created, so thatI can create the video features and labels similar to the ones described in i3d.mat file.

Alternatively, to use the pre-trained model, I need to convert my test video features from the current size of 1024 to 8192 (as explained in the paper). The process explained in the paper (quoted above) seems difficult to articulate and implement on my own. It would be helpful if you provide some means to obtain features of size 8192. Help would be greatly appreciated. Thank you.

naraysa commented 4 years ago

@saidwivedi or @vguptai Kindly take care of this.

saidwivedi commented 4 years ago

Where did you get the feature of shape (65,1,1,1024) ?

For a non-overlapping frames of 64, the output of I3D network is 1024x8x7x7. Then you do mean across temporal dimension which will become 1024x1x7x7. Then when you do average pooling of 4x4 with stride 1 it becomes 1024x1x2x2 which when flattened becomes 4096. This is for RGB and same logic goes for flow

The concatenated features of RGB and flow are 2x4096 = 8192. You can extract only RGB features from taking first 4096 features out of 8192.

GayatriPurandharT commented 4 years ago

I used the repo piergiaj/pytorch-i3d to extract features of shape [65,1,1,1024]. They come from the last layer "Logits" of InceptionI3d. The mixed_5c layer has got features of shape [1,101,1024,7,7]. Thanks for explaining the process in detail.

GayatriPurandharT commented 4 years ago

I have a doubt on inference time testing- It is mentioned in the paper and I quote:

During inference, the test video is passed through a spatio-temporal CNN to compute the real features x_test and then sent to the OD detector. If the entropy of the output f_od(x_test) is less than a threshold ent_th, the feature x_test is passed through the seen-classes classifier f_s in order to predict the label of the test video.

In val_gzsl() function of classifier_enropy.py, seen and unseen condition is defined explicitly with a bool parameter to identify seen/unseen class. I want to know how to obtain the entropy threshold value 'ent_th'. Should it be calculated? Or is it a fixed explicit value?

Thank you in advance.