Question on the cameras and the projection matrices

MatteoFusconi commented 4 months ago

In my use case, I need to reproject the points from the camera pixels to the world reference frame, given also the depth taken from the rasterization of the gaussian.

Until now I was using the FoV variables and the extrinsics, and performing the calculations of the Pinhole camera model, but I obtained wrong results.

Is it something to do with the projection matrix taken from getProjectionMatrix? Could someone give me some clarifications on what the projectionmatrix and fullprojectionmatrix are, and why are they 4x4?

Any help would be much appreciated

PanagiotisP commented 4 months ago

Camera models are always a pain to get right. I haven't bothered about the intrinsic at all, so I cannot help you with that. Regarding the world to pixel space though I think I can help. First of all, you have the world_view_transform that simply transforms points from world to view/camera space. Then you have the projection_matrix that transforms points from view/camera to NDC space. The full_proj_transform is just the combination of the two. It transforms a point from world to NDC space.

Regarding the getProjectionMatrix, the code uses the OpenGL projection matrix. A very nice blog about that can be found here. However, there are two differences with the matrix shown on that blog. The first is that the used sign is the positive one (so entries [2,2] and [3,2] have a positive sign). Also, instead of using a cube from [-1, 1] in all coordinates for the NDC space, the z coordinate spans just [0,1]. So at entry [2,2] instead of (f + n) / (f - n), you have f / (f - n). As for the 4x4 shape, it is as such to handle homogenous coordinates.

I also want to point out that OpenGL uses column-major matrices, so they might be the transposed version of what you would expect (this is covered in the blog too). Hope I helped

MatteoFusconi commented 4 months ago

Thank you for the answer, now it is way clearer.

It is still not clear to me what is the point of projecting in the NDC space. Does the rasterizer need to have the gaussians in NDC in order to work? Because the points can still be projected from the image space to the world space without passing through the NDC

Thanks a lot for your availability

PanagiotisP commented 4 months ago

NDC is nothing more than a 3D representation of the distorted (after the perspective transform) space. Definitely, you could skip it, but in general, I find it useful because it is still a 3D representation (you have a sense of depth), in which you have the comforts of orthographic projection (rays are parallel to each other, ray direction is just [0, 0, +-1], hit and occlusion tests are trivial etc).

lihao2333 commented 4 months ago

Camera models are always a pain to get right. I haven't bothered about the intrinsic at all, so I cannot help you with that. Regarding the world to pixel space though I think I can help. First of all, you have the world_view_transform that simply transforms points from world to view/camera space. Then you have the projection_matrix that transforms points from view/camera to NDC space. The full_proj_transform is just the combination of the two. It transforms a point from world to NDC space.

Regarding the getProjectionMatrix, the code uses the OpenGL projection matrix. A very nice blog about that can be found here. However, there are two differences with the matrix shown on that blog. The first is that the used sign is the positive one (so entries [2,2] and [3,2] have a positive sign). Also, instead of using a cube from [-1, 1] in all coordinates for the NDC space, the z coordinate spans just [0,1]. So at entry [2,2] instead of (f + n) / (f - n), you have f / (f - n). As for the 4x4 shape, it is as such to handle homogenous coordinates.

I also want to point out that OpenGL uses column-major matrices, so they might be the transposed version of what you would expect (this is covered in the blog too). Hope I helped

I think you missed the third difference that your camera z axis is opposite with opengl camera.
Otherwise, the projection matrix should be this, I think.

PanagiotisP commented 4 months ago

The sign I mentioned affects just the z axis. The code is pretty clear on how the sign is used. To make it clear, the projection matrix that post mentions is

\begin{bmatrix}
\dfrac{2n}{r-l} & 0 & \dfrac{r+l}{r-l} & 0\\
0 & \dfrac{2n}{t-b} & \dfrac{t+b}{t-b} & 0\\
0 & 0 & -\dfrac{f+n}{f-n} & -\dfrac{2fn}{f-n}\\
0 & 0 & -1 & 0\\
\end{bmatrix}

while this code uses:

\begin{bmatrix}
\dfrac{2n}{r-l} & 0 & \dfrac{r+l}{r-l} & 0\\
0 & \frac{2n}{t-b} & \dfrac{t+b}{t-b} & 0\\
0 & 0 & \dfrac{f}{f-n} & -\dfrac{fn}{f-n}\\
0 & 0 & 1 & 0\\
\end{bmatrix}

lihao2333 commented 4 months ago

If we put $z=-n$ in your matrix then we get $\frac{\frac{f}{f-n}(-n) - \frac{fn}{f-n}}{-n} = \frac{2f}{f-n} \neq 0$.
So I think your projection is not meat to map z from (-n, -f) into (0, 1).
Instead, you may map (n, f) into (0, 1).
In other world, your camera z axis is opposite to opengl camera z axis.

lihao2333 commented 4 months ago

I find a way to understand your projection matrix.
Opengl projection matrix is response for mapping frustum into NDC cube and let $w_n=-z_c$, that is:

x: [l, r] -> [-1, 1] 
y: [b, t] -> [-1, 1]
z: [-n, -f] -> [-1, 1].

And your projection matrix is response for mapping the frustum which is symmetric about the origin into NDC cube and only map z to [0, 1] and let $w_n=z_c$, that is:

x: [-r, -l] -> [-1, 1] 
y: [-t, -b] -> [-1, 1]
z: [n, f] -> [0, 1].

Here is my derivation:

While $x_c$ means x coord in camera coord and $x_n$ means x coord in homogenous NDC coord and $x_n'$ means x coord in NDC coord.

I tried may ways to derive your projection matrix and only this way of understanding works. Please help me if I have miss understanding.

PanagiotisP commented 4 months ago

If we put z=−n in your matrix then we get ff−n(−n)−fnf−n−n=2ff−n≠0. So I think your projection is not meat to map z from (-n, -f) into (0, 1). Instead, you may map (n, f) into (0, 1). In other world, your camera z axis is opposite to opengl camera z axis.

Ok, I see where the mixup is. You are right, the camera space here, unlike OpenGL, has a positive z-axis, so the boundaries are [n, f]. I'm sorry, I didn't even know that OpenGL had a negative z-axis even for the camera space. I thought it was only the NDC/clip space.

zzg-zzg commented 3 months ago

I find a way to understand your projection matrix. Opengl projection matrix is response for mapping frustum into NDC cube and let wn=−zc, that is:
x: [l, r] -> [-1, 1] 
y: [b, t] -> [-1, 1]
z: [-n, -f] -> [-1, 1]. 
And your projection matrix is response for mapping the frustum which is symmetric about the origin into NDC cube and only map z to [0, 1] and let wn=zc, that is:
x: [-r, -l] -> [-1, 1] 
y: [-t, -b] -> [-1, 1]
z: [n, f] -> [0, 1]. 
Here is my derivation:

While xc means x coord in camera coord and xn means x coord in homogenous NDC coord and xn′ means x coord in NDC coord.

I tried may ways to derive your projection matrix and only this way of understanding works. Please help me if I have miss understanding.

I have tried many ways to understand this function, and your answer is the most convincing explanation I've seen so far.

JianxiangL commented 2 months ago

Hi bro, I’ve been struggling with projection calculations for a while. Could you help me with some code? I’d appreciate any assistance you can offer.

        R = np.array(tar_ext[:3, :3], np.float32).reshape(3, 3).transpose(1, 0)
        T = np.array(tar_ext[:3, 3], np.float32)
        for i in range(cfg.mvsgs.cas_config.num):
            h, w = H*cfg.mvsgs.cas_config.render_scale[i], W*cfg.mvsgs.cas_config.render_scale[i]
            tar_ixt_ = tar_ixt.copy()
            tar_ixt_[:2,:] *= cfg.mvsgs.cas_config.render_scale[i]
            FovX = data_utils.focal2fov(tar_ixt_[0, 0], w)
            FovY = data_utils.focal2fov(tar_ixt_[1, 1], h)
            projection_matrix = data_utils.getProjectionMatrix(znear=self.znear, zfar=self.zfar, K=tar_ixt_, h=h, w=w).transpose(0, 1)
            world_view_transform = torch.tensor(data_utils.getWorld2View2(R, T, np.array(self.trans), self.scale)).transpose(0, 1)
            full_proj_transform = (world_view_transform.unsqueeze(0).bmm(projection_matrix.unsqueeze(0))).squeeze(0)
            camera_center = world_view_transform.inverse()[3, :3]
            camera_center = world_view_transform.inverse()[3, :3]
            novel_view_data = {
                'FovX':  torch.FloatTensor([FovX]),
                'FovY':  torch.FloatTensor([FovY]),
                'width': w,
                'height': h,
                'world_view_transform': world_view_transform,
                'full_proj_transform': full_proj_transform,
                'camera_center': camera_center
            }
            ret[f'novel_view{i}'] = novel_view_data

As far as I know, the camera's extrinsic parameters represent the camera's pose relative to the world coordinate system. In this code:

full_proj_transform = (world_view_transform.unsqueeze(0).bmm(projection_matrix.unsqueeze(0))).squeeze(0)

world_view_transform represents the camera's pose relative to the world coordinate system.Since this transformation is intended to project from the world coordinate system to the screen coordinate system, I think the order of matrix multiplication might be incorrect. It should be:

full_proj_transform = (projection_matrix.unsqueeze(0).bmm(world_view_transform.unsqueeze(0))).squeeze(0)

Could you help me with this?

Pamimi commented 2 months ago

Hi @JianxiangL, may you find something helpful in #506, I have tested this camera class, and it is right.

graphdeco-inria / gaussian-splatting

Question on the cameras and the projection matrices #826