noureldien / timeception

Timeception for Complex Action Recognition, CVPR 2019 (Oral Presentation)
https://noureldien.com/research/timeception/
GNU General Public License v3.0
157 stars 33 forks source link

How to process 1024 frames as claimed in the paper and the cvpr talk? #3

Closed fandulu closed 5 years ago

fandulu commented 5 years ago

Excuse me, could I ask how to process 1024 frames by the timeception?

When I looked into the code, I find that the input frames of your model is N_TC_TIMESTEPS: 128 other than N_INPUT_TIMESTEPS: 1024. 1024 seems to be the filter numbers. In timeception_pytorch.py, you also have n_channels_in = input_shape[1], which put 1024 at the channels position.

I am not sure if I misunderstand some part and very appreciate if you could give some explanations.

noureldien commented 5 years ago

Hi Fan Yang,

Thanks for asking. If Timeception layers are used on top of I3D (as a backbone CNN), then the total number of input frames is 1024. I3D, the backbone CNN, takes a burst of 8 successive frames and represent them as feature vector of dimension 1x7x7x1024 (time x height x width x channels). Then timeception takes 128 features and models them using 4 layers of timeception. So, the total is 128 x 8 = 1024 frames. I3D + Timeception can be trained end-to-end. Hope this helps.

I3D CNN + 2 layer Timeception => 256 frames I3D CNN + 3 layer Timeception => 512 frames I3D CNN + 4 layer Timeception => 1024 frames I3D CNN + 5 layer Timeception => 2048 frames I3D CNN + 6 layer Timeception => 4096 frames

If you want to use 2D CNN, like ResNet, you can use 6 layers of timeception, which will model 512 frames. Here, timecetpion layers truly model 512 successive frames, as the backbone CNN is 2D, it does not compress time. Each timeception layer expands the temporal footprint by a factor of two. 2D CNN + 2 layer Timeception => 32 frames 2D CNN + 3 layer Timeception => 64 frames 2D CNN + 4 layer Timeception => 128 frames 2D CNN + 5 layer Timeception => 256 frames 2D CNN + 6 layer Timeception => 512 frames

Timeception layer uses grouped convolution, I advise use to use group=8 or 16 in case of I3D, and group=16 or 32 in case of ResNet. The reason is I3D has 1024 channels, and ResNet has 2048 channels.

Hope this answers your questions.

fandulu commented 5 years ago

Hi Noureldien,

Got it, thanks very much for your explanation.