Mesh Graphormer
Transformer-based approaches are effective in modeling non-local interactions among 3D mesh vertices and body joints, whereas GCNNs are good at exploiting neighborhood vertex interactions based on a prespecified mesh topology. In this paper, we study how to combine graph convolutions and self-attentions in a transformer to model both local and global interactions.
(investigate the integration of graph convolutions and self-attentions within transformers to effectively capture both local and global interactions among 3D mesh vertices and body joints for enhanced modeling in human pose estimation tasks/mix graph convolutions and self-attentions in transformers to better understand connections between adjacent and distant points in human pose estimation.)
THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers
Regress an intermediate 3d representation in the form of surface landmarks (markers) and regularize it in training using a statistical body model, Preserve the spatial structure of high-level image features by avoiding pooling operations, and relying instead on self-attention to enrich the representation.
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer: resolution issues
performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely.(Referred from Defromable Detr)
propose a differentiable feature-level upsampling-crop strategy to enhance the hands and face regression process as inspired by the recent ViTDet: reshape the feature tokens Tf into a feature map and upsample it into multiple higher-resolution features via deconvolution layers.
Leverage 2D keypoint positions as prior knowledge to obtain better component tokens Tc than random initialization.
MotionBERT
Design Dual-stream Spatio-temporal Transformer (DSTformer) as the motion encoder to capture the long-range relationship among skeleton keypoints, In which spatial and temporal MHSA that captures the intra-frame and inter-frame body joint interactions respectively
Mesh Graphormer Transformer-based approaches are effective in modeling non-local interactions among 3D mesh vertices and body joints, whereas GCNNs are good at exploiting neighborhood vertex interactions based on a prespecified mesh topology. In this paper, we study how to combine graph convolutions and self-attentions in a transformer to model both local and global interactions. (investigate the integration of graph convolutions and self-attentions within transformers to effectively capture both local and global interactions among 3D mesh vertices and body joints for enhanced modeling in human pose estimation tasks/mix graph convolutions and self-attentions in transformers to better understand connections between adjacent and distant points in human pose estimation.)
THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers Regress an intermediate 3d representation in the form of surface landmarks (markers) and regularize it in training using a statistical body model, Preserve the spatial structure of high-level image features by avoiding pooling operations, and relying instead on self-attention to enrich the representation.
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer: resolution issues
performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely.(Referred from Defromable Detr)
propose a differentiable feature-level upsampling-crop strategy to enhance the hands and face regression process as inspired by the recent ViTDet: reshape the feature tokens Tf into a feature map and upsample it into multiple higher-resolution features via deconvolution layers.
Leverage 2D keypoint positions as prior knowledge to obtain better component tokens Tc than random initialization.
MotionBERT Design Dual-stream Spatio-temporal Transformer (DSTformer) as the motion encoder to capture the long-range relationship among skeleton keypoints, In which spatial and temporal MHSA that captures the intra-frame and inter-frame body joint interactions respectively