tedyhabtegebrial / SoftOcclusionMSI

Code accompanying CVPR 2022 paper, " SOMSI: Spherical Novel View Synthesis with Soft Occlusion Multi-Sphere Images"
13 stars 0 forks source link

generalizability #2

Closed Eric-chuan closed 2 years ago

Eric-chuan commented 2 years ago

What a nice work! Can this mpi-based approach be generalized to other scene?

tedyhabtegebrial commented 2 years ago

Thanks Eric! At the moment the model is designed to be trained per scene.

Eric-chuan commented 2 years ago

Understood. Do you have any suggestions for generalizing to other scene. By the way, will a revised version of the MODS baseline which takes a triplet as input, be open sourced?

tedyhabtegebrial commented 2 years ago

Hi Eric,

Regarding baselines: yes, we will release it. First the S-NeRF baseline will be released then the MODS baseline. Hopefully, both will be out this week.

About generalization across scenes: The issue lies in the current model design, the current architecture maps "each ray-direction" in the reference camera onto "an MSI-like representation of transparency and color", along that ray direction. Inorder to train such a model we need to 1000s of back-prop steps, on every given scene.

What you need for generalization is a model that can perform amortized inference of the scene representation from input images + cam poses. This requires two things: 1) a model that can infer the scene structure + appearance from the input views. Our 1x1 convolutional model cannot do this, as it would be impossible to infer the scene geometry with such a limited receptive field. 2) A scene representation that makes it easy to infer scene structure, for example Sphere Sweep Volumes make depth estimation easier to tackle. Therefore, one possible way to solve this would be: to combined Sphere-Sweep Volumes and passing them to a CNN with large enough receptive field (standard CNN like in MODS, or a spherical-CNN, I haven't seen any works using this).

Another direction would be, to feed images as a collection of patches to transformers to which should predict a light-field representation of the scene (see[1], which was done for perspective images, but should be possible to extend it to spherical images). Perhaps, one can also try to improve the input scene representation from "raw-patches" to something that makes inferring the scene geometry easier. For example, using geometric inductive biases from SSVs, Epipolar geometry, etc

[1] https://openaccess.thecvf.com/content/CVPR2022/papers/Sajjadi_Scene_Representation_Transformer_Geometry-Free_Novel_View_Synthesis_Through_Set-Latent_Scene_CVPR_2022_paper.pdf

Eric-chuan commented 2 years ago

Many thanks for your patient answers and great insights. The second direction you mentioned is one that I hadn't ventured into and I will look into it.

Eric-chuan commented 2 years ago

One more question. Does the lack of generalizability mean that dynamic scene(but in the same scene) is also difficult to solve?

tedyhabtegebrial commented 2 years ago

There are lots of NeRF models for dynamic scenes already. Dynamic scenes are challenging but reasons different from generalization. By lack of generalization I meant a model can only be used to render a single scene (dynamic or static).

Eric-chuan commented 2 years ago

Got it, thanks again.