snap-research / arielai_youtube_3d_hands

A dataset for 3D hand reconstruction in the wild.
Other
185 stars 15 forks source link

Questions Regarding MANO fitting #4

Closed pablovela5620 closed 4 years ago

pablovela5620 commented 4 years ago

Hi, I had some questions regarding the iterative fitting performed to create the dataset. After reading the paper my understanding is that it is split up into two separate parts. First is optimizing the camera parameters + hand orientation, and then the rest of the remaining parameters (poses and shape). I have the following questions

  1. In section 3 you explain that you optimize for the pose, shape, camera translation, and camera scaling. Specifically for the camera parameters, you explain that you initialize it similar to Simplify-x. For clarity does this mean that you are estimating the extrinsic parameters (R,t)? or just the camera translation (t)? In Simplify-x the camera translation is initialized with the assumption that the person is standing straight and similar triangles are used to estimate the depth. Is the equivalent done in your case but with just the palm joints (non-mcp joints and wrist)?

  2. Is the camera translation equivalent to the mesh translation from the camera? aka T_delta is how you translate the mesh away from the camera and s is how you scale to the world coordinates? If so why the choice to treat it as camera translation vs mesh translation? does the difference even matter?

  3. How do you initialize the camera's intrinsic parameters? Are the camera center and focal length assumed to be known values? How can this be used to get a good estimate for unconstrained in the wild images? or are you using a weak perspective camera model?

Great work on this and I appreciate the help!

dkulon commented 4 years ago

Hi @pablovela5620,

Thank you for your interest in our paper.

Camera parameters are also optimised in the second stage jointly with remaining parameters.

  1. Scale is optimised only for the datasets with 3D hand pose annotations (evaluation benchmarks / additional training data). Initially, we optimise MANO global orientation and camera translation for 2D datasets based on the palm joints.

  2. They are equivalent. This particular implementation is just a design choice.

  3. Focal length if fixed to 5000. Center is set to the mean of palm landmarks.

pablovela5620 commented 4 years ago

Great thank you so much for answering my questions!