We propose two novel datasets, i.e., STRefer and LifeRefer, which focus on large-scale human-centric daily-life scenarios accompanied with abundant 3D object and natural language annotations.
We uniformly sampled 662 scenes from original STCrowd dataset, including a total length of 65 minutes, for STRefer and annotate 5,458 natural language descriptions for 3,581 subjects. The scene here means a frame of synchronized LiDAR point cloud and image. The content in each scene distinguishes from others due to changing capture locations or time. We split it into training and testing data by 4:1 without data leakage. LifeRefer involves 25,380 natural language descriptions for 11,864 subjects based on 3,172 scenes, which has totally 103 minutes length. Similarly, we split it into 14,650 training data and 10,730
testing data without data leakage.
@article{lin2023wildrefer,
title={Wildrefer: 3d object localization in large-scale dynamic scenes with multi-modal visual data and natural language},
author={Lin, Zhenxiang and Peng, Xidong and Cong, Peishan and Hou, Yuenan and Zhu, Xinge and Yang, Sibei and Ma, Yuexin},
journal={arXiv preprint arXiv:2304.05645},
year={2023}
}
We propose two novel datasets, i.e., STRefer and LifeRefer, which focus on large-scale human-centric daily-life scenarios accompanied with abundant 3D object and natural language annotations. We uniformly sampled 662 scenes from original STCrowd dataset, including a total length of 65 minutes, for STRefer and annotate 5,458 natural language descriptions for 3,581 subjects. The scene here means a frame of synchronized LiDAR point cloud and image. The content in each scene distinguishes from others due to changing capture locations or time. We split it into training and testing data by 4:1 without data leakage. LifeRefer involves 25,380 natural language descriptions for 11,864 subjects based on 3,172 scenes, which has totally 103 minutes length. Similarly, we split it into 14,650 training data and 10,730 testing data without data leakage.
Paper Project Code