noureldien / timeception

Timeception for Complex Action Recognition, CVPR 2019 (Oral Presentation)
https://noureldien.com/research/timeception/
GNU General Public License v3.0
157 stars 33 forks source link

Charades Dataset Loading #17

Open NemioLipi opened 4 years ago

NemioLipi commented 4 years ago

Hi, Thanks for sharing your code. Have you sampled all the videos of the charades dataset to have 1024 frames before loading? This procedure may take a lot of memory. Is'nt it possible to upsample the resulted feature maps of the original 25fps sampling videos on the provided pretrained I3D to have 128,7,7,1024 instead of e.g. 45,7,7,1024? Would it affect the performance of timeception afterwards?

noureldien commented 4 years ago

Hello Nemio Lipi,

Sorry for late reply. Didn't pay attention to the notification of github. Yes, 45 segments instead of 128 would affect the performance. To reproduce the results of the paper, you have to randomly sample new frames each epoch. Please note that you need to sample features before training each epoch https://github.com/noureldien/videograph/blob/master/experiments/exp_epic_kitchens.py#L97

And look here to see how to sample the frames. Uniform (equdistant) sampling is done for test videos only https://github.com/noureldien/videograph/blob/master/datasets/ds_breakfast.py#L601

But random sampling is done for training videos, and you have to uniformly sample segments before each epoch. Sample only segments, but don't sample frames in each segment. Each segment should contain 8 successive frames.

And here how to extract the features https://github.com/noureldien/videograph/blob/master/datasets/ds_breakfast.py#L765

So, to answer your question directly. Yes, if you train on pre-defined features the performance drops significantly. Because Timeception layers need to see new features of new segments each training epoch.

However, there is a trick that might alleviate this overhead. Do the following:

  1. Pre-train the backbone CNN on the dataset.
  2. Extract feature of 1024 segments.
  3. When training Timeception layer, then you can sample from features, rather than having to sample from segments and feedforward through the backbone.
  4. By doing so, you only do the feedforward for the backbone only once. But the downside is that you have to extract and save a lot of features. 1024 features (segments) for each video.
NemioLipi commented 4 years ago

Thanks a lot for the response. As the number of frames may be very large, wouldn't the last trick you mentioned cause OOM problems?

noureldien commented 4 years ago

What do you mean by OOM problem?

basavaraj-hampiholi commented 3 years ago

Hi @noureldien,

Its really nice work and a good presentation at CVPR-19 by Efstratios Gavves. And thanks for sharing the code.

I have a couple of queries regarding the timeception paper and dataloading:

  1. The input to I3D for feature extraction is 3xTx224x224, where T is 1024. I3D yields feature with dim-1024x128x7x7. So you sample the entire video of any length (let's say 5268 frames) to a fixed 1024 frame clip? And all these 1024 frames in the input video clip belong to the same class? (as timeception is not producing framewise probabilities.)

  2. If all of the frames belong to the same class how are you learning complex actions (consists of several one-actions) with different temporal extents using multi-scale temporal kernels? (mentioned in section 4.2 of the paper).

Thanks, Raj