linukc / SUN3D_DATASETS

3D scene understanding datasets
2 stars 0 forks source link

SceneVerse #24

Open linukc opened 1 month ago

linukc commented 1 month ago

We propose SceneVerse, the first million-scale 3D vision-language dataset with 68K 3D indoor scenes and 2.5M vision-language pairs. SceneVerse contains 3D scenes curated from diverse existing datasets of both real and synthetic environments. Harnessing the power of 3D scene graphs and LLMs, we introduce an automated pipeline to generate comprehensive and high-quality language for both object-level and scene-level descriptions. We additionally incorporate the most extensive human-annotated object referrals to date, providing new training sources and benchmarks in this field.

Paper Project Code

@article{jia2024sceneverse,
  title={SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding},
  author={Jia, Baoxiong and Chen, Yixin and Yu, Huangyue and Wang, Yan and Niu, Xuesong and Liu, Tengyu and Li, Qing and Huang, Siyuan},
  journal={arXiv preprint arXiv:2401.09340},
  year={2024}
}
linukc commented 1 month ago

Provided Language Types

We list the available data in the current version of SceneVerse in the table below:

Dataset Object Caption Scene Caption Ref-Annotation Ref-Pairwise
rel2
Ref-MultiObject
relm
Ref-Star
star
Ref-Chain (Optional)
chain
ScanNet ScanRefer
Nr3D
MultiScan
ARKitScenes
HM3D template
3RScan
Structured3D template
ProcTHOR template | ❌ | ❌ | template | template | template

For the generated object referrals, we provide both the direct template-based generations template and the LLM-refined versions gpt. Please refer to our supplementary for the description of selected pair-wise / multi-object / star types. We also provide the chain type which contains language using obejct A to refer B and then B to refer the target object C. As we found the chain type could sometimes lead to unnatural descriptions, we did not discuss it in the main paper. Feel free to inspect and use it in your projects.