Closed TitaniumOne closed 3 years ago
@TitaniumOne thank your interest in our work.
Thank you so much for your kindly reply! So with the help of 2D CNN, the dimension of skeleton feature map is reduced to 2D from 3D. Am I right?
I think the feature after 2D CNN, named h*
, contains the spatioal and temporal semantics (infomation) simultaneously. Why it can be separated ? And how to make it?
As for temporal feature, I still have no idea of how to set the number of neurons, what is the relationship or constraint between neurons and input frames?
Thanks again!
For modeling temporal information, conv operations followed by pooling ops are performed. The features are finally flattened before fed to FC layers. h* is separated into two streams where each stream learns semantics aware attention weights (spatial & temporal). The backdrop facilitates the features at each stream to learn spatial and temporal semantics. How to set the no. of neurons - For this please refer to the I3D architecture. If your input dimension is 64x224x224x3, at mixed_5c the feature dimension will be 8x7x7x1024. Now, encode 8 neurons for time and 7x7 = 49 neurons for space.
I hope all your queries have been answered. So, I am closing this issue.
Thank you for your great work! I have two qusetions as follows: Q1: According to the code blew, I am not aware of why
As or z1
andAt or z2
denotesptioal features
andtemporal features
, respectively. From where I stand, sptioal features derive from features through the convolution layer along the channel axis, while temporal features derive from features through the pooling layer. Do I misunderstand it or miss some details?Q2: What's the contribution about
2D CNN layer
after thePose Backbone
? How dose it help with the spatioal-temporal coupler? Looking for your reply! Thank you!