mikeqzy / 3dgs-avatar-release

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
MIT License
278 stars 25 forks source link

Test on custom poses #23

Closed awfuact closed 1 day ago

awfuact commented 1 week ago

Hi @mikeqzy, thank you for your incredible work! I am interested in testing your model on custom pose sequences that include only betas, poses, root_orient, and trans attributes. I have a couple of questions:

  1. Could you please provide the data preprocessing script that is used for generating the out-of-distribution pose examples?
  2. Is it possible to specify the camera extrinsic parameters for rendering within your model framework?
  3. I want to confirm my understanding regarding the canonicalized SMPL vertices. Are these vertices shared across the whole sequence, given that they are only calculated with the first frame of the sequence?

Thanks!

mikeqzy commented 4 days ago

Hi, thanks for your interest in our work, I hope this answers your questions:

  1. Unfortunately I lost the script on my previous workstation. I think to animate the avatar, beta is not required as it will change the shape of the avatar. So you'll need to combine beta from the avatar model and your custom poses, root_orient and trans and use the SMPL model to compute the bone transformations, then save into the same format as in the dataset preprocessing script.
  2. Yes. The camera intrinsics and extrinsics are loaded here in the dataloader. You can easily replace it with your own camera parameters.
  3. If you are referring to the canonicalized SMPL vertices computed here, then yes it is shared across the whole sequence. The data path of the first frame is just used to read the SMPL vertices with shape displacement. We then forward skin it to the canonical space to compute the bounding box for coordinate normalization and sample points on the mesh as the initialization of canonical 3D Gaussians.
AndrewChiyz commented 2 days ago

Thanks for the question and answer! I also have a few questions when testing on customized videos for unseen persons. (1) I am wondering how to get the initial point cloud of the 3D canonical space from a video. It seems the 3D points in the canonical space are vertices defined on the 3D SMPL model. I mean, how to calculate the canonical 3D points from a video by using a sequence of estimated SMPL parameters? (2) Why do we need the canonical space to render frames with novel views or novel poses? Previous work, like AnimatableNeRF, HumanNeRF, Neural Actor, etc., seems follow an 'observation space -> canonical space -> observation space' pipeline (another form of "encoder-decoder" framework, maybe?). Why not just map the point cloud of a target pose to the target RGB value directly? I guess Maybe it is because the NeRF or 3DGS are for static person/scene? Are there any other reasons or advantages by following this pipeline? Thanks!

mikeqzy commented 2 days ago

Thanks for the question and answer! I also have a few questions when testing on customized videos for unseen persons. (1) I am wondering how to get the initial point cloud of the 3D canonical space from a video. It seems the 3D points in the canonical space are vertices defined on the 3D SMPL model. I mean, how to calculate the canonical 3D points from a video by using a sequence of estimated SMPL parameters? (2) Why do we need the canonical space to render frames with novel views or novel poses? Previous work, like AnimatableNeRF, HumanNeRF, Neural Actor, etc., seems follow an 'observation space -> canonical space -> observation space' pipeline (another form of "encoder-decoder" framework, maybe?). Why not just map the point cloud of a target pose to the target RGB value directly? I guess Maybe it is because the NeRF or 3DGS are for static person/scene? Are there any other reasons or advantages by following this pipeline? Thanks!

(1) Given the estimated SMPL parameters of the subject in a video, we only take the shape of the subject and compute the SMPL mesh under the canonical star pose. Then we sample points on the mesh surface as the initialization of 3D Gaussians. (2) The previous NeRF-based methods you mentioned and our 3DGS-based method all follow the paradigm to learn the human model in the canonical space and use skinning to establish the correspondence between canonical space and observation space. It is difficult to directly learn in observation space as the position of the same body part would vary a lot in different poses. Utilizing the kinetic prior of human, defining a uniform canonical space allows the model (implicit field / 3DGS) to learn in the same coordinate, which is easy to intergrate information from different poses and more generalizable for novel pose animation. NeRF-based methods requires point sampling in observation space for volumetric rendering, thus they need backward skinning to map from observation space to canonical space for implicit field query, while as an explicit representation, 3D Gaussians can be directly optimized in the canonical space, so only forward skinning is required. So in conclusion, learning in canonical space is a design for dynamic person/scene.

I hope it answers your questions!