kxhit / EscherNet

[CVPR2024 Oral] EscherNet: A Generative Model for Scalable View Synthesis
https://kxhit.github.io/EscherNet
Other
297 stars 16 forks source link

Question about intrinsics for 3D reconstruction #5

Open AlbertoRemus opened 8 months ago

AlbertoRemus commented 8 months ago

Hi! really nice work

I'm using Eschernet 6Dof and in my dataset I would need to use different intrinsics for different images, I guess it's not an issue for NeuS renderer

My question however is about both the intrinsics and the range used by Objaverse

https://github.com/kxhit/EscherNet/blob/569240f81d729aae413167a898c2af21dc969dd2/6DoF/dataset.py#L82-L85 https://github.com/kxhit/EscherNet/blob/569240f81d729aae413167a898c2af21dc969dd2/3drecon/renderer/renderer.py#L493

Do I need to use those specific Objaverse intrinsics or is there a workaround?

All the best,

Alberto

kxhit commented 8 months ago

Hi, thanks for your interest! EscherNet is trained and supposed to be used with the same intrinsic images. For example, we trained on Objaverse and it works when tested on NeRF (different intrinsics with Objaverse).

I think for your case, you need to align your different intrinsics to the same intrinsic by scaling your images, and center-crop them to be squared images. Before NeuS reconstruction, check the output of EscherNet looks reasonable.

Please let me know if it works and happy to follow it up!

Best, Xin

AlbertoRemus commented 8 months ago

Hi Many thanks for your reply!

I try to add some more details:

in my dataset I deal with rendered images that have different intrinsics. Also, the translation does not respect the range of Objaverse dataset (that seems to be in [1.5, 2.2]).

For output poses and subsequent 3d reconstruction with generated images, I solved the problem by using ground truth poses' values of the GSO "school_bus1" object (whose values follow the training dataset criteria).

However, the issue remains regarding the input poses. I tried to scale the translation in the [R, t] so that it respected the training range (without modifying the image). For example, from an image, I had 7. translation's value and I scaled it to a factor of 5. However, while it seems to work reasonably for one input image, the performance deteriorates when the number of input images increases, apparently in contrast with your results.

I guess this is due to the scaling method, which could break the object's geometry. I would also prefer to keep images as they are (implying that I imagine an object to be bigger or smaller if I decrease or increase the initial translation value), because the objects could become too small for the EscherNet generation.

I would kindly ask if you have any hint (maybe about how to even act on the translation extrinsic) by arranging translation to be in the same range without breaking the geometry and obtaining benefits from the use of multi-input images.

I also link some images showing the performances's drop with respect the number of images.

input_1view 0_1view

input_5view 0_5view

I also add the input views we have: render_mvs_25_car.zip translation in here have been retrieved by dividing by 5 the original ones, to grant the range 1.5-2.2, for sake of testing even though geometry might break

And the output poses we used render_sync_36_single_schoolbus1.zip

Car object and poses have been canonically aligned to the school bus

AlbertoRemus commented 8 months ago

As an update our problem can be summarized as follows:

Hope that it helps

AlbertoRemus commented 8 months ago

Screenshot from 2024-03-09 16-44-04 Could our problem be related to this part of the paper?

kxhit commented 8 months ago

Hi @AlbertoRemus, thanks for sharing all those details and updates!

The current 6DoF CaPE is trained with Objaverse 3DoF data (each camera is pointing to the same 3D point all the time), so the test images are required to be looking at the same 3D point. We are trying to work on more general training, but currently lack GPU resources.

If your data is 3DoF (all cameras are pointing to the same 3D point), you could align the focal length to be the same by scaling the translation along the radius direction. In this case, resizing the images will be equal to changing the focal length or scaling the radius.

I just had a quick check of your provided data. The poses of 018 and 021 have similar distances but the images of them seem to vary a lot, if the intrinsics are already aligned to be the same, maybe you should resize the images to be a similar size and scale the radius relatively at the same time. A quick check will be using NeRF to reconstruct the scene, and then identify whether the conversion is correct.

I'm quite interested in your application scenario, and how different intrinsics are those cameras, usually we assume the images are taken by similar cameras. Would you mind sharing the intrinsics?

Thank you!

AlbertoRemus commented 8 months ago

Hi @kxhit, many thanks for your feedback!

Intrinsic parameters #18 vs #21 indeed vary a lot. You can find the list here intrinsics.zip

resizing the image could be a bit problematic since the scaling factor would be in the order of 5 to 7, given the translation we have in extrinsics, we could have a bit too small image probably

The application is to employ this in more in the wild scenarios

kxhit commented 8 months ago

Hi @AlbertoRemus

Is those intrinsics for the original images or the cropped images you provided here?

Would you mind sharing the raw data (original images, intrinsics, poses) to me (here or email), I would like to look into it, I think it is very valuable in the real-world more general usage cases.

Thank you very much!

AlbertoRemus commented 8 months ago

@kxhit many thanks for your words, yes those are for the cropped ones, I'll provide you everything via mail, and wrap up here if we find good solutions!

Alberto