hwjiang1510 / LEAP

[ICLR 2024] Code for LEAP: Liberate Sparse-view 3D Modeling from Camera Poses
169 stars 5 forks source link

question about camera pose? #4

Closed yuedajiong closed 1 month ago

yuedajiong commented 10 months ago

question #1: when train: need known camera pose(s) for related input images(s), rights? that is: NEED pose while train, need NOT pose while infer. right?

question#2: to input data, if the origin pose/R&t for Alice is face-side, but the pose for Bob is back-side, to same category object(e.g here human), how to keep the 'consistent represetation'?

hwjiang1510 commented 10 months ago

For problem 1: Yes. But different from prior works, we didn't learn camera pose estimation and didn't incorporate any pose representations within the model. The camera poses are only used for rendering during training.

For problem 2: The definition of poses should be consistent during training for rendering.

yuedajiong commented 10 months ago

@hwjiang1510 Thanks.

  1. yes, I read your paper and code. no pose esitmation(against robust-loc and colmap), no pose representation in model. (已经很牛逼了,离完美很近了。) but, although the pose is only used in stage#2 render, not in stage#1 predictor. anyway, it still used before loss. so, it is still needed as supervised information. (it is perfect for infer, need not any pose information. ) so we can say: as long as a renderer is used in the training process, the renderer generally requires pose, so no matter which stage the pose is used during training, the pose is indispensable. Unless the algorithm does not require a renderer (in self-supervision, it is generally required.), or the renderer does not need to input the pose corresponding to the image, or even does not require a pose.

  2. this is not real issue or question, just an 'advice seeking'. I am working for an unified stereo vision design.
    this quesion for me is: how to design a pose-representation, it shoud like be: real_pose = f_liner_mapping (unified_pose). that is, for example to human pose representation, in dataset a, you can let azimuth 0 to face side and clockwise moving, but in dataset b, we can let azimute 0 to back side and counterclockwise. in our pipilie of model, we can add a mapping layer to use different representation to unified pose.
    why do i have this quesion? because for human-like object, there are clearly fore-back definition, but for some others objects such as cup, mouse/computer, it is diffcult to define the start-point for azimuth and elevation. but, we need the consitence for generalizability,at least in same category. better to be implicit, with out explicit category label management.

thanks.

hwjiang1510 commented 10 months ago
  1. Yes, the pose will always be necessary during training and your understanding is correct. But for inference, it is not necessary.

  2. To me, your description of the two human pose datasets is more like a different pose coordinate frame design. Maybe this question is related to the "canonical pose". Generally, for category-level pose estimation, you will define a pose as canonical. Or you can predict relative poses given two images. The following paper may be relevant. [1] Zhang, Jason Y. et al. “RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild.” ECCV 2022. [2] Jiang, Hanwen et al. “Few-View Object Reconstruction with Unknown Categories and Camera Poses.” ArXiv abs/2212.04492 (2022): n. pag.

yuedajiong commented 10 months ago

@hwjiang1510 thanks. yes, relative pose is another solution.

DavidYan2001 commented 9 months ago

Hi dear researchers, regarding the talk around question #1, does that mean this method aims at mapping sparse view imgs without pose to a field (and the field is the actual output), while previous methods mapped imgs with pose info to a field? Or from another perspective, the process of mapping sparse view imgs without pose to a field includes obtaining the neural volume with priors( and use this prior to decode a radiance field?).

hwjiang1510 commented 9 months ago

Hi dear researchers, regarding the talk around question #1, does that mean this method aims at mapping sparse view imgs without pose to a field (and the field is the actual output), while previous methods mapped imgs with pose info to a field? Or from another perspective, the process of mapping sparse view imgs without pose to a field includes obtaining the neural volume with priors( and use this prior to decode a radiance field?).

Your understanding is correct for both questions.