How to extract camera extrinsics?

RobinRenggli commented 3 years ago

I want to use img2pose to extract the camera pose (for the purpose of using it as the input for NeRF). On page 3 of the paper it is stated that this can be obtained from the 6DoF pose h by "standard means," but I'm struggling to figure out how this is done. I'm especially struggling with how to determine the t of the [R¦t] matrix; I did manage to extract the R.

In short: How can one obtain a camera pose from the output 6DoF pose of this model?

Edit: My specific use case is an input video of a single talking head; I would like to get a camera pose determined by the head pose for each frame; i.e. interpret head movement as camera movement instead.

vitoralbiero commented 3 years ago

Hello, Thank you for your interest in our work.

In img2pose, we assume a fixed camera intrinsics as described in the paper, where we obtain the head 6DoF to transform 3D points corresponding to the head, so I do not have a function ready for your specific use case.

t is added to the points after they are transformed by R. Take a look at the transform_points function which might help you.

RobinRenggli commented 3 years ago

As I am still struggling with this issue, I am asking again in the hopes someone can help me out.

I want to use the 6DoF pose to get a camera pose. The following sketch shows what I want to achieve:

sketch

I know I can obtain the camera rotation by inverting the rotation matrix of the pose. But I lack the understanding of this model to transform the translation of the pose into a translation of the camera.

Here's my current attempt in code with img2_pose being the pose provided by your model: rotation = np.zeros((4,4))
r = Rotation.from_rotvec(img2_pose[0:3]).as_matrix().T
face_pose = (img2_pose[3:])
camera_pose = r.dot(face_pose)
rotation[0:3, 0:3] = r
rotation[0:3, 3] = camera_pose
rotation[3, :] = [0,0,0,1]

By rendering an average face over my input pictures I can test whether my approach is working or not. It shows that the rotation is correct, but the translation is off.

I know this is not directly related to this project, but I feel I can't solve this issue because I do not understand the coordinate systems etc. of this paper well enough. If this is not the appropriate place to ask this question, feel free to close the issue again.

vitoralbiero commented 3 years ago

The units in the translation vector (tvec) are arbitrary units of the 3D face model used as a reference to annotate the images, they do not represent pixels or other human-understandable units. Their reference point is the center of the image, and they are consistent across images, where the camera intrinsics is what changes w.r.t. image dimensions.

The tvec is the same as the output of SolvePnP from OpenCV, and you can read more about it here https://docs.opencv.org/master/d9/d0c/group__calib3d.html#ga549c2075fac14829ff4a58bc931c033d

To get the camera position, you can follow the posts below, as they explain how it can be done. https://stackoverflow.com/questions/18637494/camera-position-in-world-coordinate-from-cvsolvepnp https://answers.opencv.org/question/64315/solvepnp-object-to-camera-pose/

Hope this helps.

RobinRenggli commented 3 years ago

Is the rotvec also a result of SolvePnP?

I think I know how to solve my issue in theory, but I'm struggling with the transformations from one coordinate frame to another. This is of course my own issue to solve and I don't expect you to help me with this, I just want to be certain I understand the output of your model correctly, such that I can remove any uncertainties.

Am I correct in assuming that the output of your model is the pose and position of the face in the object frame? It is in an OpenCV format, not OpenGL.

What I need is a camera-to-world transform (in OpenGL format) according to the pose, which I think you are using when you render the 3D face over the image. If that were the case, shouldn't I be able to simply take the matrix you use for the rendering and convert it to the OpenGL format?

By following the steps outline in the links you gave me and converting it to an OpenGL format, I arrive at a correct Rotation matrix but the translation is always off. This might be due a bug in my code or due to me misunderstanding the output of your model. I just want to rule out the latter.

P.s.: Thank you for helping me out, your answers have already improved my understanding of the problem by a lot!

RobinRenggli commented 3 years ago

I'm adding some of my renderings that illustrate the problem. This is the result when I follow the steps outlined in the links you provided:

example1 example2

vitoralbiero commented 3 years ago

Yes, the rotvec is also a result of SolvePnP. The entire output of img2pose is in the same format as SolvePnP, which is in OpenCV format.

From the example you send, it looks like t_x is off, but t_y and t_z seems correct (or at least not as off as t_x). Is the prediction before conversion correct? I mean, if you render the original estimated pose, how does it look like? If you haven't checked this yet, you can do it by using this notebook.

vitoralbiero commented 3 years ago

Closing this issue for inactivity.

vitoralbiero / img2pose

How to extract camera extrinsics? #32