We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiance fields and camera poses. This latent representation is then used to learn a generative model that enables both unconditional and conditional generation of 3D scenes. Our model generalizes previous works that focus on single objects by removing the assumption that the camera pose distribution can be shared across samples. We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets and allows for conditional generation of 3D scenes given conditioning variables like sparse image observations or text that describes the scene.
🔑 Key idea:
3D scene rendering from a moving camera. Notice that the trajectory data $X := \lbrace x_{i \in \lbrace 0, \dots, n \rbrace } \rbrace$ is a variable length sequence of RGBs, depth images, and 6 DoF camera poses.
They paid homage to Antoni Gaudí with their method's name. — "The creation continues incessantly through the media of humans."
💪 Strength:
Interestingly, they consider the entropy of camera poses in the pose latent perturbation in the radiance field decoder (Sec. 3.1).
😵 Weakness:
It requires ground-truth poses and depth images.
I couldn't find any ablation study on using depth images, even though many other works do not use depth information.
Why didn't they use the inductive bias of trajectory data except for the entropy of camera poses and the temporal position $s$ in the camera pose decoder? Are they enough for modeling?
🤔 Confidence:
Medium
✏️ Memo:
I attended the NeurIPS 2022 Expo Talk by Apple about this work.
GAUDI: A Neural Architect for Immersive 3D Scene Generation
Bautista et al., NeurIPS 2022
🔑 Key idea:
💪 Strength:
😵 Weakness:
🤔 Confidence:
✏️ Memo: