jfzhang95 / pytorch-video-recognition

PyTorch implemented C3D, R3D, R2Plus1D models for video activity recognition.
MIT License
1.2k stars 252 forks source link

Downsample step #9

Open LeoniekevandenBulk opened 6 years ago

LeoniekevandenBulk commented 6 years ago

I was checking your Pytorch implementation of the R2Plus1D model against the implementation in Caffe2 in the repository of the original paper (https://github.com/facebookresearch/VMZ), and I was wondering why you chose to implement the downsample step as a SpatioTemporalConv layer, while in the original implementation they seem to use only one Conv3D layer. They have coded it as follows:

if (num_filters != input_filters) or down_sampling: shortcut_blob = self.model.ConvNd( shortcut_blob, 'shortcutprojection%d' % self.comp_count, input_filters, num_filters, [1, 1, 1], weight_init=("MSRAFill", {}), strides=use_striding, no_bias=self.no_bias, ) if spatial_batch_norm: shortcut_blob = self.model.SpatialBN( shortcut_blob, 'shortcutprojection%d_spatbn' % self.comp_count, num_filters, epsilon=1e-3, is_test=self.is_test, )

Was this design choice on purpose, and if so, could you perhaps tell me why?

Thanks!

jfzhang95 commented 6 years ago

Hi, sorry for the late reply.

You could look in here. When model is r2plus1d, is_decomposed is set to True.

When is_decomposed is set True, it uses SpatioTemporalConv instead of merely 3DConv, which could be checked in here.

LeoniekevandenBulk commented 6 years ago

Hi, thanks for your reply.

I understand that a SpatioTemporalConv is needed for the R(2+1)D network, but I don't think the original authors use it in their downsample step specifically, as can be found here. Your downsample step however, does use a SpatioTemporalConv. Could you explain why?

JinXiaozhao commented 5 years ago

in your R(2+1)D network code: self.conv3 = SpatioTemporalResLayer(64, 128, 3, layer_sizes[1], block_type=block_type, downsample=True) downsample (bool, optional): If True, the first block in layer will implement downsampling. Default: False output size = 128 input size = 64 ,why downsample=True? Thanks! @jfzhang95