Seems still need the pose information of the single input image

ken2576 / vision-nerf

Official PyTorch Implementation of paper "Vision Transformer for NeRF-Based View Synthesis from a Single Input Image", WACV 2023.

MIT License

107 stars 12 forks source link

Seems still need the pose information of the single input image #5

Closed SeaBird-Go closed 1 year ago

SeaBird-Go commented 1 year ago

Hi, thanks for sharing this work.

As you mentioned in the paper, the vision-nerf could synthesize the novel views conditioned on the single unposed input image. However, from the code in render_ray.py, I found it seems still requires the pose information of the source image.

Could you point out whether I misunderstand something?

caiyongqi commented 1 year ago

I also notice it, IN section 4.3: "Note that our method does not require any camera pose as input, which is often difficult to obtain from real images."

ken2576 commented 1 year ago

Hi,

You only need a dummy input like any identity matrix torch.eye(4). It is required because we need to calculate the relative pose between source and target. Please try using identity matrix as your input src_c2w_mats and let me know if you have any other issue.

SeaBird-Go commented 1 year ago

@ken2576 Thanks for your clarification.

So you mean if I reset the src_c2w_mats to a fixed identity matrix, the results almost the same? I'm trying this experiments and see whether there are any performance degradation.

ken2576 commented 1 year ago

Yes, we didn't set up any canonical coordinates so it shouldn't matter as long as the relative pose is the same. In fact, we calculated it here. https://github.com/ken2576/vision-nerf/blob/main/models/projection.py#L84

SeaBird-Go commented 1 year ago

Sorry, I'm confused about this. What mean "as long as the relative pose is the same"? In the training stage, you randomly sampled a source rgb and target rgb images, the relative poses maybe different.

I understand either the pixelNeRF and your method modeling the NeRF in the input view coordinate system. So in this line https://github.com/ken2576/vision-nerf/blob/main/models/projection.py#L84, you transform the sampled 3D points in the scene coordinate into the source view coordindate system by using the camera poses of the source view, and then you can interpolate the features from the source image feature maps.

In this case, if we set the src_c2w_mats to a fixed identity matrix, can we obtain the right the transformations?

ken2576 commented 1 year ago

The high-level idea is that you have some target 3D points, then you can transform them to the source camera coordinate system and use the xyz in camera coordinates as input. Now, if you don't have a GT source camera pose (but you know the relative pose between source and target), then you can set src_c2w_mats to identity.

And what happens now is:

Unproject samples to 3D points using the target pose (relative to the source pose)
Transform them to source camera coordinate system (identity because it's already relative to the source camera)
Use xyz as input to the MLP

So yes, the transformation will still be correct.