Closed lebron12332 closed 2 months ago
Really a wonderful work! After reading the paper, I have the following questions:
- The input is unposed sparse images and you use the DUSt3R method. DUSt3R can obtain the point cloud and camera parameters of the input images. Will the estimated camera parameters be used in the pipeline?
- The model produces some consistent frames, and use these frame to optimize the 3DGS. How are the camera parameters for generated frames obtained?
- You mentioned in the paper that you will use video interpolation to generate new frames. Since the video interpolation takes two images as input, how do you expand to multiple image inputs? I noticed it was mentioned in the paper, as
Our framework ReconX is agnostic to the number of input views.Specifically, given N views as input, we sample a plausible camera trajectory to render image pairs using our video diffusion models and finally optimize the 3D scene from all generated frames.
But I still don't understand it. Could you explain it?Thank you in advance and look forward to your reply.
Thank you for your thoughtful questions and for your interest in our work. We present the point-to-point response as follows.
@liuff19 A follow-up question: how do you align the estimated pose from dust3r and GT poses provided by dataset?
Really a wonderful work! After reading the paper, I have the following questions: 1) The input is unposed sparse images and you use the DUSt3R method. DUSt3R can obtain the point cloud and camera parameters of the input images. Will the estimated camera parameters be used in the pipeline?
2) The model produces some consistent frames, and use these frame to optimize the 3DGS. How are the camera parameters for generated frames obtained?
3) You mentioned in the paper that you will use video interpolation to generate new frames. Since the video interpolation takes two images as input, how do you expand to multiple image inputs? I noticed it was mentioned in the paper, as
Our framework ReconX is agnostic to the number of input views.Specifically, given N views as input, we sample a plausible camera trajectory to render image pairs using our video diffusion models and finally optimize the 3D scene from all generated frames.
But I still don't understand it. Could you explain it?Thank you in advance and look forward to your reply.