liuff19 / ReconX

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
https://liuff19.github.io/ReconX/
MIT License
560 stars 19 forks source link

Some questions about pipeline #2

Closed lebron12332 closed 2 months ago

lebron12332 commented 2 months ago

Really a wonderful work! After reading the paper, I have the following questions: 1) The input is unposed sparse images and you use the DUSt3R method. DUSt3R can obtain the point cloud and camera parameters of the input images. Will the estimated camera parameters be used in the pipeline?

2) The model produces some consistent frames, and use these frame to optimize the 3DGS. How are the camera parameters for generated frames obtained?

3) You mentioned in the paper that you will use video interpolation to generate new frames. Since the video interpolation takes two images as input, how do you expand to multiple image inputs? I noticed it was mentioned in the paper, as Our framework ReconX is agnostic to the number of input views.Specifically, given N views as input, we sample a plausible camera trajectory to render image pairs using our video diffusion models and finally optimize the 3D scene from all generated frames. But I still don't understand it. Could you explain it?

Thank you in advance and look forward to your reply.

liuff19 commented 2 months ago

Really a wonderful work! After reading the paper, I have the following questions:

  1. The input is unposed sparse images and you use the DUSt3R method. DUSt3R can obtain the point cloud and camera parameters of the input images. Will the estimated camera parameters be used in the pipeline?
  2. The model produces some consistent frames, and use these frame to optimize the 3DGS. How are the camera parameters for generated frames obtained?
  3. You mentioned in the paper that you will use video interpolation to generate new frames. Since the video interpolation takes two images as input, how do you expand to multiple image inputs? I noticed it was mentioned in the paper, as Our framework ReconX is agnostic to the number of input views.Specifically, given N views as input, we sample a plausible camera trajectory to render image pairs using our video diffusion models and finally optimize the 3D scene from all generated frames. But I still don't understand it. Could you explain it?

Thank you in advance and look forward to your reply.

Thank you for your thoughtful questions and for your interest in our work. We present the point-to-point response as follows.

  1. Different from existing camera control-based methods, we do not use the estimated camera parameters during the training of the video diffusion model.
  2. For the generated consistent frames, we also use DUSt3R to estimate the camera parameters.
  3. Regarding multi-image input, we handle this by generating intermediate frames in a pair-wise manner between each pair of images. We then use these generated frames to optimize the 3D scene. If you have more questions, feel free to email me, and we can schedule a time for further discussion.
rwn17 commented 2 months ago

@liuff19 A follow-up question: how do you align the estimated pose from dust3r and GT poses provided by dataset?