Open linukc opened 1 month ago
We list the available data in the current version of SceneVerse in the table below:
Dataset | Object Caption | Scene Caption | Ref-Annotation | Ref-Pairwiserel2 |
Ref-MultiObjectrelm |
Ref-Starstar |
Ref-Chain (Optional)chain |
---|---|---|---|---|---|---|---|
ScanNet | ✅ | ✅ | ScanRefer Nr3D |
✅ | ✅ | ✅ | ✅ |
MultiScan | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
ARKitScenes | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
HM3D | template |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
3RScan | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
Structured3D | template |
✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
ProcTHOR | template | ❌ | ❌ | template | template | template |
❌ |
For the generated object referrals, we provide both the direct template-based generations template
and the LLM-refined versions gpt
.
Please refer to our supplementary for the description of selected pair-wise
/ multi-object
/ star
types. We also
provide the chain
type which contains language using obejct A to refer B and then B to refer the target object C. As we found
the chain
type could sometimes lead to unnatural descriptions, we did not discuss it in the main paper. Feel free to inspect
and use it in your projects.
We propose SceneVerse, the first million-scale 3D vision-language dataset with 68K 3D indoor scenes and 2.5M vision-language pairs. SceneVerse contains 3D scenes curated from diverse existing datasets of both real and synthetic environments. Harnessing the power of 3D scene graphs and LLMs, we introduce an automated pipeline to generate comprehensive and high-quality language for both object-level and scene-level descriptions. We additionally incorporate the most extensive human-annotated object referrals to date, providing new training sources and benchmarks in this field.
Paper Project Code