Questions about attention network

TitaniumOne commented 3 years ago

Thank you for your great work! I have two qusetions as follows: Q1: According to the code blew, I am not aware of why As or z1 and At or z2 denote sptioal features and temporal features, respectively. From where I stand, sptioal features derive from features through the convolution layer along the channel axis, while temporal features derive from features through the pooling layer. Do I misunderstand it or miss some details?

    z1 = Dense(256, activation='tanh', name='z1_layer', trainable=True)(model_gcnn.get_layer('gcnn_out').output)
    z2 = Dense(128, activation='tanh', name='z2_layer', trainable=True)(model_gcnn.get_layer('gcnn_out').output)

    fc_main_spatial = Dense(49, activity_regularizer=attention_reg, kernel_initializer='zeros', bias_initializer='zeros',
                    activation='sigmoid', trainable=True, name='dense_spatial')(z1)
    fc_main_temporal = Dense(2, activity_regularizer=attention_reg, kernel_initializer='zeros',
                            bias_initializer='zeros',
                            activation='softmax', trainable=True, name='dense_temporal')(z2)

Q2: What's the contribution about 2D CNN layer after the Pose Backbone? How dose it help with the spatioal-temporal coupler? Looking for your reply! Thank you!

srijandas07 commented 3 years ago

@TitaniumOne thank your interest in our work.

z1 and z2 are the latent features from where the spatial and temporal attention weights are derived. See their dimension, 49 neurons for 7x7 spatial dimension of the visual feature map and 2 (or 8, if input frames=64) neurons for the temporal dimension of the visual feature map (I3D output).
The 2D CNN layer is fed with the GCN output features inorder to perform temporal modeling on Pose based features.

TitaniumOne commented 3 years ago

Thank you so much for your kindly reply! So with the help of 2D CNN, the dimension of skeleton feature map is reduced to 2D from 3D. Am I right? I think the feature after 2D CNN, named h*, contains the spatioal and temporal semantics (infomation) simultaneously. Why it can be separated ? And how to make it? As for temporal feature, I still have no idea of how to set the number of neurons, what is the relationship or constraint between neurons and input frames? Thanks again!

srijandas07 commented 3 years ago

For modeling temporal information, conv operations followed by pooling ops are performed. The features are finally flattened before fed to FC layers. h* is separated into two streams where each stream learns semantics aware attention weights (spatial & temporal). The backdrop facilitates the features at each stream to learn spatial and temporal semantics. How to set the no. of neurons - For this please refer to the I3D architecture. If your input dimension is 64x224x224x3, at mixed_5c the feature dimension will be 8x7x7x1024. Now, encode 8 neurons for time and 7x7 = 49 neurons for space.

srijandas07 commented 3 years ago

I hope all your queries have been answered. So, I am closing this issue.

srijandas07 / VPN

Questions about attention network #7