Multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. MMScan provides a multi-modal 3D scene dataset with hierarchical grounded language annotations, covering holistic aspects on both object- and region-level.
@inproceedings{mmscan,
title={MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations},
author={Lyu, Ruiyuan and Wang, Tai and Lin, Jingli and Yang, Shuai and Mao, Xiaohan and Chen, Yilun and Xu, Runsen and Huang, Haifeng and Zhu, Chenming and Lin, Dahua and Pang, Jiangmiao},
year={2024},
booktitle={arXiv},
}
Multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. MMScan provides a multi-modal 3D scene dataset with hierarchical grounded language annotations, covering holistic aspects on both object- and region-level.
Paper Project Code