swathikirans / ego-rnn

Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition - BMVC 2018
36 stars 19 forks source link

Regarding implementation details #1

Closed Nd-sole closed 4 years ago

Nd-sole commented 5 years ago

Hi @swathikirans

I read your paper and also went through your code. I have few things unclear while implementing. Could you explain me in reference to your paper:

1) What you mean by RGB stage1 and stage2 Flow Twostream

I see that in two stream you combine both RGB and Flow. Could you also explain it in reference to the network diagram shown in your paper?

2)For stage2, do you use stage1 as pretrained?

swathikirans commented 5 years ago

1) In the last paragraph of section 3.2, the training details of RGB stream is explained. The RGB stream is trained in two stages: stage1, where the ConvLSTM and the classifier layers are trained (green colored blocks in fig.2) and in stage 2, the final layer of ResNet, classifier of ResNet, ConvLSTM and the classifier layers are trained.

For flow, a ResNet-34 pre-trained on the imagenet dataset is used. The input convolutional layer of the ResNet is modified to accept a stack of optical flow images.

The RGB network and flow network are first trained separately. Then they are combined to obtain the two stream network. The figure only shows the RGB stream since it is the contribution of the paper. For the flow and two stream, we follow standard approaches.

3)Yes, stage2 uses the stage1 pre-trained model

Nd-sole commented 5 years ago

Thank you for your response.

Flow network has same structure as the RGB network. Can you explain the difference between them (I am sorry if it sounds naive to you)?

Since RGB network is the one shown in paper. So, I assume flow network is also the same?

swathikirans commented 5 years ago

Flow network follows the exact architecture of ResNet-34 with a change only in the input conv layer and the final classifier layer. There is no attention on the flow network. The flow network accepts a stack of consecutive optical flow images from a video and classify them to an activity class.

The flow network is defined in this file: https://github.com/swathikirans/ego-rnn/blob/master/flow_resnet.py

Nd-sole commented 5 years ago

ok, Thanks a lot. I am sorry for missing this information in your previous answer.

So, what I understood is that RGB Network is shown in the figure which uses resnet34 along with attention network. It is trained in two stages as explained in the paper and your answer.

Flow network is only resnet34 and is trained independently from the rgb network and it has its own full-fledged output for classification

Then you ensemble the output from both the networks in two stream. But you also train both the networks in two stream again by using the pretrained network from above two steps. I am a bit skeptical of my understanding of the two-stream network.

as you said in above answer, combining both the networks, you got two stream, what do you mean by combining? Can you explain, how you made the two stream network?

swathikirans commented 5 years ago

For the two stream, we concatenated the output before the classifier layer from the individual streams (for ResNet-34, they are of dimension 512) and added a new FC layer that maps this 2*512 vector to the number of classes. The FC layer of this new network, and the all the layers of the individual networks till Layer 4 are then trained.

We also tried a simple average fusion of the prediction scores from the individual networks, as is the common practice followed. However we found that the above approach improved the performance of the network.

Nd-sole commented 5 years ago

Hi,

Why did you used nn.conv2d instead of nn.conv3d (also all other layers are in 2d)?. I consider that you were inputting the frames from video. Can you explain how you are doing it?

swathikirans commented 5 years ago

I guess this question is regarding the flow stream. For each RGB frame pair, two corresponding flow images are present, one in the x direction and one in the y direction, of dimension 1xWxH. We used a stack of optical flow corresponding to 5 frames. Thus we have 10 flow images as input. The flow images are concatenated across the channel dimension (1 in this case) instead of a new temporal dimension, thereby obtaining an input of 10xWxH. We did not use 3D convolutions since we wanted to make use of the imagenet pre-trained network. The idea is explained well in the following paper: https://arxiv.org/pdf/1608.00859.pdf under Cross Modality Pre-training