lioryariv / volsdf

MIT License
369 stars 52 forks source link

Some questions about rend_util.py #12

Closed DavidXu-JJ closed 1 year ago

DavidXu-JJ commented 1 year ago

Hi, thank you for your decent work. I try to follow your work recently and I meet some problems which I wish to get answers from this issue.

  1. First question: In function load_K_Rt_from_P at line 48 in rend_util.py: https://github.com/lioryariv/volsdf/blob/a974c883eb70af666d8b4374e771d76930c806f3/code/utils/rend_util.py#L48-L50 This code really makes me confused and I'm not able to give an explanation to it. I read the following code at line 78 in rend_util.py: https://github.com/lioryariv/volsdf/blob/a974c883eb70af666d8b4374e771d76930c806f3/code/utils/rend_util.py#L73-L78 It seems that you use pose as a cameraToWorld matrix. I did an experiment in advance, the following code is from stackoverflow:
    
    k = np.array([[631,   0, 384],
              [  0, 631, 288],
              [  0,   0,   1]])
    r = np.array([[-0.30164902,  0.68282439, -0.66540117],
              [-0.63417301,  0.37743435,  0.67480953],
              [ 0.71192167,  0.6255351 ,  0.3191761 ]])
    t = np.array([ 3.75082481, -1.18089565,  1.06138781])

C = np.eye(4) C[:3, :3] = k @ r C[:3, 3] = k @ r @ t

out = cv2.decomposeProjectionMatrix(C[:3, :])

If I convert `r` and `t` into a homogeneous coordinate, then I take the `R@T`,  which is the `worldToCamera` matrix. I will get:

T=np.eye(4) T[:3,3]=t R=np.eye(4) R[:3,:3]=r R@T array([[-0.30164902, 0.68282439, -0.66540117, -2.64402567], [-0.63417301, 0.37743435, 0.67480953, -2.10814783], [ 0.71192167, 0.6255351 , 0.3191761 , 2.27037141], [ 0. , 0. , 0. , 1. ]])

Then if I take the inverse of R@T, which I think is the `cameraToWorld` matrix. I will get:

np.linalg.inv((R@T)) array([[-0.30164902, -0.63417301, 0.71192166, -3.75082481], [ 0.6828244 , 0.37743435, 0.6255351 , 1.18089565], [-0.66540117, 0.67480953, 0.3191761 , -1.06138781], [ 0. , 0. , 0. , 1. ]])


This result seems that, to get the `cameraToWorld` matrix, we should concatenate the `R^(-1)` and `-T`, instead of `R^(-1)` and `T` referred in line 31 in rend_util.py:
https://github.com/lioryariv/volsdf/blob/a974c883eb70af666d8b4374e771d76930c806f3/code/utils/rend_util.py#L48-L50
I don't know why it takes `R^(-1)` and `T` here.
  1. Second question: In function lift in line 96 in rend_util.py: https://github.com/lioryariv/volsdf/blob/a974c883eb70af666d8b4374e771d76930c806f3/code/utils/rend_util.py#L96-L109 I don't know why the x_lift takes y and fy into consideration. It seems that sk should be 0, but I test it in runtime and I get:
    intrinsics
    tensor([[[ 2.8923e+03, -2.1742e-04,  8.2320e+02,  0.0000e+00],
         [ 0.0000e+00,  2.8832e+03,  6.1907e+02,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]]],
       device='cuda:0')

    It seems that sk is not 0. So the transformation becomes:

$$ \begin{bmatrix} x'\\y'\\z \end{bmatrix}= \begin{bmatrix} f_x&sk&c_x&0\\ 0&f_y&c_y&0\\ 0&0&1&0 \end{bmatrix} \begin{bmatrix} x\_lift\\y\_lift\\z\\1 \end{bmatrix} $$

Here [x,y,z,1] is the point in the camera coordinates. I find that:

$$ x'=f_x \cdot x\_lift + sk \cdot y\_lift + c_x \cdot z $$

The actual result of x_lift is:

$$ x\_lift = \cfrac{x'-c_x \cdot z}{f_x} - sk \cdot y\_lift $$

But in rend_list.py, x_lift is like to be:

$$ x\_lift = \cfrac{(x'-c_x)\cdot z}{f_x} - sk \cdot y\_lift $$

So when z=1, the code is correct. Would it be better if it is simply changed to be:

x_lift = (x / z - cx.unsqueeze(-1) + cy.unsqueeze(-1)*sk.unsqueeze(-1)/fy.unsqueeze(-1) - sk.unsqueeze(-1)*y/fy.unsqueeze(-1)) / fx.unsqueeze(-1) * z

(/ z is added to the x)

The first question means more to me than the second question. Would you please explain the logic of pose matrix to me.

Hope this issue would help other people as well.

I try my best to express my question as clear as possible. If there's something unclear or wrong with me, please inform of me.

DavidXu-JJ commented 1 year ago

The answer to the confusing Problem 1 is figured out, https://github.com/lioryariv/volsdf/blob/a974c883eb70af666d8b4374e771d76930c806f3/code/utils/rend_util.py#L78-L79 Here in line 79, camera location is setting to be T vector: https://github.com/lioryariv/volsdf/blob/a974c883eb70af666d8b4374e771d76930c806f3/code/utils/rend_util.py#L63 However, the actual camera location is at -T vector. What matters in this function is the relative position between the pixel location and the camera location, so cameraToWorld matrix doesn't need to take the -T as its translation part. I remain my opinion on Problem 2. But since it's not the crucial part, so I close this issue. At last, I'm sorry for the annoying 'open' and 'close' of my issue.(I'm not very much familiar with the operation on the issue) EOF

raynehe commented 1 year ago

@DavidXu-JJ Hi! Sorry to bother you. I encountered a similar problem related to DTU dataset's coordinate system convention, and I'm wondering if you know about it.

My dataset follows NeRF's coordinate system convention, that is OpenGL convention (x-axis to the right, y-axis upward, and z-axis backward along the camera’s focal axis).

My issue is, if I apply the dataset to VolSDF directly, the computed ray_dir is incorrect. I think the problem is in the rotation matrix, DTU/BlendedMVs might follow a different convention. But I couldn't find anything about the coordinate system convention of DTU dataset, do you know about this?

Thank you very much!

DavidXu-JJ commented 1 year ago

@raynehe coords If I doesn't mess it up, I remember the most dataset follow OpenCV coords. Maybe you can try to simply reverse the y and z axis. I'm sorry if my suggestion doesn't help or is wrong.