Introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples human mesh recovery task into two parts: video-based 3D human pose estimation and mesh vertices regression from the estimated 3D pose and temporal image feature.
先行研究と比べてどこがすごい?/ What makes this work greater than existing works?
Existing video-based methods generally recover human mesh by estimating the complex pose and shape parameters from coupled image features, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns.
Propose a two-stream encoder: One stream takes a 2D pose sequence detected from input images to estimate the mid-frame 3D pose, the other stream takes static image features extracted from images and aggregates them for a temporal image feature.
Design a co-evolution decoder that performs pose and mesh interactions with an image-guided Adaptive Layer Normalization (AdaLN). (This AdaLN guides the interactions by adjusting the statistical characteristics of joint and vertex features based on the temporal image feature to make pose and mesh fit the human body shape. )
技術や手法のキモはどこ? /What is the heart of this technology or method?
2D pose normalization by the full image(??)
Previous top-down methods: region of humans is detected and cropped before being processed--->the cropping operation is effective in reducing background noise and simplifying feature extraction, but discards the location information in the full image, which is essential to predict the global rotation in the original camera coordinate system--->normalize the 2D pose with respect to the full image instead of the cropped region.
Architecture
Two-Stream Encoder
(1) normalized 2D pose sequence---> spatial-temporal Transformer (ST-Transformer)--->mid-frame 3D pose
(2) T frames--->bi-directional GRU--->mid-frame temporal feature
Co-Evolution Decoder
Joint feature serves as the query Q((vertex feature for another branch), while vertex feature (joint feature for another branch) is regarded as key K and value V
(1)3D pose P0, temporal image feature f, coarse template mesh M0 (provided by SMPL)---> co-evolution block--->output pose, coarse mesh--->upsampling--->original mesh
(2)3D pose focuses on skeletal motion, mainly provide pose information//image feature contains visual cues, such as body shape and surface deformation, which is complementary to sparse 3D pose.
Adaptive layer normalization
(1) Each feature is normalized by AdaLN with the image feature f. shape information contained in the image feature can be injected into the joint and vertex features, while preserving their spatial structure.(?)
どうやって有効だと検証した? /How is this work validated?
議論はある?/ Any discussion for this work?
読んでいてわからなかったところは?/ What don't you understand for this paper?
2D pose normalization by the full image
Adaptive layer normalization
2D pose normalization by the full image(??) Previous top-down methods: region of humans is detected and cropped before being processed--->the cropping operation is effective in reducing background noise and simplifying feature extraction, but discards the location information in the full image, which is essential to predict the global rotation in the original camera coordinate system--->normalize the 2D pose with respect to the full image instead of the cropped region.
Architecture
Two-Stream Encoder (1) normalized 2D pose sequence---> spatial-temporal Transformer (ST-Transformer)--->mid-frame 3D pose (2) T frames--->bi-directional GRU--->mid-frame temporal feature Co-Evolution Decoder
Joint feature serves as the query Q((vertex feature for another branch), while vertex feature (joint feature for another branch) is regarded as key K and value V
(1)3D pose P0, temporal image feature f, coarse template mesh M0 (provided by SMPL)---> co-evolution block--->output pose, coarse mesh--->upsampling--->original mesh (2)3D pose focuses on skeletal motion, mainly provide pose information//image feature contains visual cues, such as body shape and surface deformation, which is complementary to sparse 3D pose. Adaptive layer normalization (1) Each feature is normalized by AdaLN with the image feature f. shape information contained in the image feature can be injected into the joint and vertex features, while preserving their spatial structure.(?)
どうやって有効だと検証した? /How is this work validated?
議論はある?/ Any discussion for this work?
読んでいてわからなかったところは?/ What don't you understand for this paper? 2D pose normalization by the full image Adaptive layer normalization
公開コードやデータセットは? /Public codes or datasets? code: https://github.com/kasvii/PMCE