soshishimada / DecafDatasetScript

7 stars 2 forks source link

How do you transform the hand and face meshes to the same coordinate? #1

Open haonanhe opened 6 months ago

haonanhe commented 6 months ago

Thanks for your excellent work. I have a question about how you get the initialized meshes of hand and head and how to transform them to the same coordinate. This question includes:

  1. Which models do you use to get the hand mesh (MeshTransformer or else) and face mesh (DECA or else)?
  2. How do you transform them into the same coordinate? Since they are obtained by different models, they could have different coordinates.
  3. How do you get the camera parameters? Are they the camera parameters predicted by DECA or calibrated in some way?
  4. How many views do you use to train? You mentioned in your paper that you collected 15 views for each subject, while the released dataset only incorporates 8 views. Is this dataset enough to train the DecConNets mentioned in your paper?

I would appreciate it if you answered my questions.

soshishimada commented 6 months ago

1.Which models do you use to get the hand mesh => We just solve multiview-based fitting optimization with gradient descent for the dataset. FLAME face model and MANO hand model are used to represent the mesh.

  1. How do you transform them into the same coordinate? => For the dataset, the face and hand are represented in the same world reference frame. The world frame is determined by placing a checkerboard in the recording studio in our camera calibration process. For our Decaf method, both face and hands are in an input camera frame, i.e., zero translation and zero rotation.

  2. How do you get the camera parameters? => For the dataset recording in our studio, we calibrate cameras using a checkerboard.

  3. => For the dataset recording, 15 views were used. For Decaf training, we used the provided 8 views that are near frontal views.

haonanhe commented 6 months ago

Thank you for your explanation. It's really helpful! I have another issue. Does the DefConNet learn the projection between input cropped images and output (contact labels and deformation) without conditioning on the estimated vertices of the input images? Do you have any plan to release the code and checkpoints of your model? It would be really helpful.