Closed Eric-chuan closed 2 years ago
Thanks Eric! At the moment the model is designed to be trained per scene.
Understood. Do you have any suggestions for generalizing to other scene. By the way, will a revised version of the MODS baseline which takes a triplet as input, be open sourced?
Hi Eric,
Regarding baselines: yes, we will release it. First the S-NeRF baseline will be released then the MODS baseline. Hopefully, both will be out this week.
About generalization across scenes: The issue lies in the current model design, the current architecture maps "each ray-direction" in the reference camera onto "an MSI-like representation of transparency and color", along that ray direction. Inorder to train such a model we need to 1000s of back-prop steps, on every given scene.
What you need for generalization is a model that can perform amortized inference of the scene representation from input images + cam poses. This requires two things: 1) a model that can infer the scene structure + appearance from the input views. Our 1x1 convolutional model cannot do this, as it would be impossible to infer the scene geometry with such a limited receptive field. 2) A scene representation that makes it easy to infer scene structure, for example Sphere Sweep Volumes make depth estimation easier to tackle. Therefore, one possible way to solve this would be: to combined Sphere-Sweep Volumes and passing them to a CNN with large enough receptive field (standard CNN like in MODS, or a spherical-CNN, I haven't seen any works using this).
Another direction would be, to feed images as a collection of patches to transformers to which should predict a light-field representation of the scene (see[1], which was done for perspective images, but should be possible to extend it to spherical images). Perhaps, one can also try to improve the input scene representation from "raw-patches" to something that makes inferring the scene geometry easier. For example, using geometric inductive biases from SSVs, Epipolar geometry, etc
Many thanks for your patient answers and great insights. The second direction you mentioned is one that I hadn't ventured into and I will look into it.
One more question. Does the lack of generalizability mean that dynamic scene(but in the same scene) is also difficult to solve?
There are lots of NeRF models for dynamic scenes already. Dynamic scenes are challenging but reasons different from generalization. By lack of generalization I meant a model can only be used to render a single scene (dynamic or static).
Got it, thanks again.
What a nice work! Can this mpi-based approach be generalized to other scene?