szymanowiczs / splatter-image

Official implementation of `Splatter Image: Ultra-Fast Single-View 3D Reconstruction' CVPR 2024
https://szymanowiczs.github.io/splatter-image
BSD 3-Clause "New" or "Revised" License
795 stars 54 forks source link

question regarding the computation of inferred depth #18

Closed Torment123 closed 6 months ago

Torment123 commented 7 months ago

Hi, thank you for the great work. I'm looking at the code, and there's one place I'm a bit confused: in line 710 of splatter-image/scene /gaussian_predictor.py, the code is:

pos = ray_dirs_xy * depth + offset

But based on my understanding, shouldn't it be:

pos = ray_dirs_xy * depth + coordinate_of_camera_center + offset ?

since we need to add the camera center's coordinate to get the position of the point, and modify it with the offset, please let me know if I got it wrong, thanks

johnren-code commented 7 months ago

Hi, from what I understand, the predictions made by gaussian_predictor about the Gaussian parameters are independent of the position of the observed camera, which will be used when rendering from the Gaussian sphere to the image.

Torment123 commented 7 months ago

Hi, from what I understand, the predictions made by gaussian_predictor about the Gaussian parameters are independent of the position of the observed camera, which will be used when rendering from the Gaussian sphere to the image.

Hi, thanks for the reply. But since the z_near and z_far are relative to the camera center, and the depth d of the gaussians are represented in terms of z_near and z_far (depth = self.depth_act(depth_network) * (self.cfg.data.zfar - self.cfg.data.znear) + self.cfg.data.znear), so I think it is relative to the camera center as well, and when computing the absolute coordinate of the gaussians in camera space, the coordinate of the camera center needs to be added, just like in NeRF, when computing the coordinate of the sampled points in a rendered ray

johnren-code commented 7 months ago

Oh, I understand your confusion, please review the following code at lines 782-788. This might solve your problem.

# Pos prediction is in camera space - compute the positions in the world space
        pos = self.flatten_vector(pos)
        pos = torch.cat([pos, 
                         torch.ones((pos.shape[0], pos.shape[1], 1), device="cuda", dtype=torch.float32)
                         ], dim=2)
        pos = torch.bmm(pos, source_cameras_view_to_world)
        pos = pos[:, :, :3] / (pos[:, :, 3:] + 1e-10)
Torment123 commented 7 months ago

Oh, I understand your confusion, please review the following code at lines 782-788. This might solve your problem.

# Pos prediction is in camera space - compute the positions in the world space
        pos = self.flatten_vector(pos)
        pos = torch.cat([pos, 
                         torch.ones((pos.shape[0], pos.shape[1], 1), device="cuda", dtype=torch.float32)
                         ], dim=2)
        pos = torch.bmm(pos, source_cameras_view_to_world)
        pos = pos[:, :, :3] / (pos[:, :, 3:] + 1e-10)

thanks, I think I got it

Torment123 commented 7 months ago

I have another question, I see that for calculating the position of the gaussian (pos = ray_dirs_xy * depth + offset), the directions vectors (ray_dirs_xy) isn't normalized to unit length (see line 544). However in NeRF code, the direction vectors are normalized. What's the reason of this difference? In my opinion, the unnoamlized treatment is correct, as the z coordinate of the direction vector remains to be 1, and by multiplying with the targeted depth value p, the 3d point will have the correct z value depth (p x 1); for normalized case, after multiplying it with p, it will reach a point with z less than p.

johnren-code commented 7 months ago

Hi, I haven't looked at the original NeRF code, but I have looked at the neural rendering part of the work related to NeRF, where the ray direction is indeed normalized, but where the ray direction is obtained by connecting the coordinates of the pixel in the world coordinate system with the position of the camera.

In the source code of the splatter image, the predictions about depth are made in the camera coordinate system, and znear and zfar are also relative to the camera coordinate system, and as you say, the z-coordinate is always 1 when the ray direction is processed, which may be to make it easier to process the depth information. pos = ray_dirs_xy * depth + offset gives you pos which is the position of the Gaussian center point in the camera coordinate system, it is then multiplied by a cam2world matrix to get the coordinates of the Gaussian center point in the world coordinate system.

I think these are two different implementations, NeRF is predicting directly in the world coordinate system, whereas splatter image is predicting first in the camera coordinate system and then further transforming to the world coordinate system. (ps: I am also a beginner in 3D nerve reconstruction, my opinion is not always correct, I hope it can be useful for you :) )

Torment123 commented 6 months ago

Hi, I haven't looked at the original NeRF code, but I have looked at the neural rendering part of the work related to NeRF, where the ray direction is indeed normalized, but where the ray direction is obtained by connecting the coordinates of the pixel in the world coordinate system with the position of the camera.

In the source code of the splatter image, the predictions about depth are made in the camera coordinate system, and znear and zfar are also relative to the camera coordinate system, and as you say, the z-coordinate is always 1 when the ray direction is processed, which may be to make it easier to process the depth information. pos = ray_dirs_xy * depth + offset gives you pos which is the position of the Gaussian center point in the camera coordinate system, it is then multiplied by a cam2world matrix to get the coordinates of the Gaussian center point in the world coordinate system.

I think these are two different implementations, NeRF is predicting directly in the world coordinate system, whereas splatter image is predicting first in the camera coordinate system and then further transforming to the world coordinate system. (ps: I am also a beginner in 3D nerve reconstruction, my opinion is not always correct, I hope it can be useful for you :) )

thanks for your detailed reply. I just found out a discussion about this problem, I think it explains it well: https://github.com/yenchenlin/nerf-pytorch/issues/76

johnren-code commented 6 months ago

Hi, do you know if splatter image can be trained with higher resolution images? Like 256 or 512, if so, do I need to make changes to the network architecture?

Torment123 commented 6 months ago

Hi, do you know if splatter image can be trained with higher resolution images? Like 256 or 512, if so, do I need to make changes to the network architecture?

Hi, you change the image resolution through the config file, the code should be directly runnable. But I'm not sure whether it can reach the same performance as 128x128

szymanowiczs commented 6 months ago

Hi @Torment123, @johnren-code it seems like you clarified most of your concerns, but here's a summary in case it's useful:

pos = ray_dirs_xy depth + offset pos = ray_dirs_xy depth + coordinate_of_camera_center + offset ?

are actually equivalent. These are then transformed to world space with

# Pos prediction is in camera space - compute the positions in the world space
        pos = self.flatten_vector(pos)
        pos = torch.cat([pos, 
                         torch.ones((pos.shape[0], pos.shape[1], 1), device="cuda", dtype=torch.float32)
                         ], dim=2)
        pos = torch.bmm(pos, source_cameras_view_to_world)
        pos = pos[:, :, :3] / (pos[:, :, 3:] + 1e-10)

The ray directions are not normalised, because network predicts depth, and not distance along ray (they are slightly different, depth is distance perpendicular to camera's optical axis). Depth is normally easier to predict for neural networks. NeRF uses ray directions to sample points along rays geometrically, and the network doesn't actually ever see or predict depth, only the xyz coordinates after computing xyz = ray_dir * distance along ray. Hopefully this clarifies it a bit.

Training at higher resolution should work directly, there might be a hard-coded 128 resolution here and there in the code so look out for that but it should be easy to change if you wish.

johnren-code commented 6 months ago

Hello, sorry to bother you again. I would like to ask you if you know why the parameters znear, zfar and the scaling of the gaussian parameter is set like this for the cars and chairs in the synthetic dataset? And how znear and zfar are obtained in the synthetic dataset? I'm also currently trying to re-splatter image a bit with my own generated synthetic dataset, but I'm not quite sure how these parameters are set, and whether the settings of these parameters have a large impact on the experimental results?

PhilipsDeng commented 1 month ago

hi im having the same question here too, do u already have an answer?