scene-verse / SceneVerse

Official implementation of ECCV24 paper "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding"
https://scene-verse.github.io
MIT License
164 stars 2 forks source link

Details of point cloud alignment and how to bring in custom datasets to the eval pipeline #14

Closed zubair-irshad closed 1 month ago

zubair-irshad commented 1 month ago

Hi authors,

Great work! Can you please share more details about the pointcloud alignment mentioned in your paper below:

To ensure cohesion across various sources, we conduct preprocessing steps such as room segmentation, point subsampling, axis alignment, normalization, and semantic label alignment.

In essence, can you share a code snippet of how to do this alignment for a new dataset. Does it also work for a completely different setting, i.e. can we align the point clouds in the same way for a robotics dataset or a table-top setting for zero-shot evaluation? Thanks for your help in advance!

yixchen commented 1 month ago

Hi, we have released the instructions preprocess/README.md and an example script for data preprocessing and alignment in preprocess/rscan.py. Note that for zero-shot transfer to robotics or table-top settings, fine-tuning may be necessary, as the vision encoder and alignment are optimized for indoor scene domains. Achieving effective generalization in this context may not be straightforward.

zubair-irshad commented 1 month ago

Thanks so much. Closing this issue!

zubair-irshad commented 1 month ago

One quick question: Can you please elaborate, what do you mean by close-vocabulary training? Based on my understanding, your pretraining is general alignment between 3D and language and hence is this step absolutely necessary for your pretraining step?

This is optional for SceneVerse training but may be required for closed-vocab training

Buzz-Beater commented 1 month ago

Hi, the closed vocabulary training means using a model with a class prediction/classification head which is supervised by a fixed class of semantic labels (i.e., ScanNet 607). As in SceneVerse we directly aligned 3D and language, there is no need for such head and hence it does not necessarily requires mapping your class label to 607 (since you can use the original labels directly for alignment). However, if you want to use a classification head or models trained on ScanNet, then you probably want to map your classes to 607.

zubair-irshad commented 1 month ago

Thank you, your answer aligns with my understanding of your pretraining.