scene-verse / SceneVerse

Official implementation of ECCV24 paper "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding"
https://scene-verse.github.io
MIT License
200 stars 3 forks source link

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang

 

Paper arXiv Project Page Video Data Model

 

SceneVerse Teaser

We propose SceneVerse, the first million-scale 3D vision-language dataset with 68K 3D indoor scenes and 2.5M vision-language pairs. We demonstrate the scaling effect by (i) achieving state-of-the-art on all existing 3D visual grounding benchmarks and (ii) showcasing zero-shot transfer capabilities with our GPS (Grounded Pre-training for Scenes) model.

News

Data

See DATA.md for detailed instructions on data download, processing, visualization. The data inventory is listed below:

Dataset Object Caption Scene Caption Ref-Annotation Ref-Pairwise
rel2
Ref-MultiObject
relm
Ref-Star
star
Ref-Chain (Optional)
chain
ScanNet ScanRefer
Nr3D
MultiScan
ARKitScenes
HM3D template
3RScan
Structured3D template
ProcTHOR template | ❌ | ❌ | template | template | template

Training and Inference

See TRAIN.md for the inventory of available checkpoints and detailed instructions on training and testing with pre-trained checkpoints. The checkpoint inventory is listed below:

Setting Description Corresponding Experiment Checkpoint based on experiment setting
pre-trained GPS model pre-trained on SceneVerse 3D-VL grounding (Tab.2) Model
scratch GPS model trained on datasets from scratch 3D-VL grounding (Tab.2)
SceneVerse-val (Tab. 3)
ScanRefer, Sr3D, Nr3D, SceneVerse-val
fine-tuned GPS model fine-tuned on datasets with grounding heads 3D-VL grounding (Tab.2) ScanRefer, Sr3D, Nr3D
zero-shot GPS model trained on SceneVerse without data from ScanNet and MultiScan Zero-shot Transfer (Tab.3) Model
zero-shot text GPS Zero-shot Transfer (Tab.3) ScanNet, SceneVerse-val
text-ablation Ablations on the type of language used during pre-training Ablation on Text (Tab.7) Template only, Template+LLM
scene-ablation Ablations on the use of synthetic scenes during pre-training Ablation on Scene (Tab.8) Real only, S3D only, ProcTHOR only
model-ablation Ablations on the use of losses during pre-training Ablation on Model Design (Tab.9) Refer only, Refer+Obj-lvl, w/o Scene-lvl

BibTex

@inproceedings{jia2024sceneverse,
  title={Sceneverse: Scaling 3d vision-language learning for grounded scene understanding},
  author={Jia, Baoxiong and Chen, Yixin and Yu, Huangyue and Wang, Yan and Niu, Xuesong and Liu, Tengyu and Li, Qing and Huang, Siyuan},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024}
}

Acknowledgements

We thank the authors from ScanRefer, ScanNet, 3RScan, ReferIt3D, Structured3D, HM3D, ProcTHOR, ARKitScenes, MultiScan for open-sourcing their awesome datasets. We also heavily adapted codes from ScanQA, SQA3D, and 3D-VisTA for training and inference.