Closed awfuact closed 1 day ago
Hi, thanks for your interest in our work, I hope this answers your questions:
beta
is not required as it will change the shape of the avatar. So you'll need to combine beta
from the avatar model and your custom poses
, root_orient
and trans
and use the SMPL model to compute the bone transformations, then save into the same format as in the dataset preprocessing script.Thanks for the question and answer! I also have a few questions when testing on customized videos for unseen persons. (1) I am wondering how to get the initial point cloud of the 3D canonical space from a video. It seems the 3D points in the canonical space are vertices defined on the 3D SMPL model. I mean, how to calculate the canonical 3D points from a video by using a sequence of estimated SMPL parameters? (2) Why do we need the canonical space to render frames with novel views or novel poses? Previous work, like AnimatableNeRF, HumanNeRF, Neural Actor, etc., seems follow an 'observation space -> canonical space -> observation space' pipeline (another form of "encoder-decoder" framework, maybe?). Why not just map the point cloud of a target pose to the target RGB value directly? I guess Maybe it is because the NeRF or 3DGS are for static person/scene? Are there any other reasons or advantages by following this pipeline? Thanks!
Thanks for the question and answer! I also have a few questions when testing on customized videos for unseen persons. (1) I am wondering how to get the initial point cloud of the 3D canonical space from a video. It seems the 3D points in the canonical space are vertices defined on the 3D SMPL model. I mean, how to calculate the canonical 3D points from a video by using a sequence of estimated SMPL parameters? (2) Why do we need the canonical space to render frames with novel views or novel poses? Previous work, like AnimatableNeRF, HumanNeRF, Neural Actor, etc., seems follow an 'observation space -> canonical space -> observation space' pipeline (another form of "encoder-decoder" framework, maybe?). Why not just map the point cloud of a target pose to the target RGB value directly? I guess Maybe it is because the NeRF or 3DGS are for static person/scene? Are there any other reasons or advantages by following this pipeline? Thanks!
(1) Given the estimated SMPL parameters of the subject in a video, we only take the shape of the subject and compute the SMPL mesh under the canonical star pose. Then we sample points on the mesh surface as the initialization of 3D Gaussians. (2) The previous NeRF-based methods you mentioned and our 3DGS-based method all follow the paradigm to learn the human model in the canonical space and use skinning to establish the correspondence between canonical space and observation space. It is difficult to directly learn in observation space as the position of the same body part would vary a lot in different poses. Utilizing the kinetic prior of human, defining a uniform canonical space allows the model (implicit field / 3DGS) to learn in the same coordinate, which is easy to intergrate information from different poses and more generalizable for novel pose animation. NeRF-based methods requires point sampling in observation space for volumetric rendering, thus they need backward skinning to map from observation space to canonical space for implicit field query, while as an explicit representation, 3D Gaussians can be directly optimized in the canonical space, so only forward skinning is required. So in conclusion, learning in canonical space is a design for dynamic person/scene.
I hope it answers your questions!
Hi @mikeqzy, thank you for your incredible work! I am interested in testing your model on custom pose sequences that include only
betas
,poses
,root_orient
, andtrans
attributes. I have a couple of questions:Thanks!