ScanNet Preprocessing Scene

claragarciamoll commented 1 year ago

Hello,

Thanks for answering me and uploading the code with promptness.

Currently, I am working on instance detection and replacing them with CAD models. Until now, I detected the instances using the Mask3D algorithm using our own PointClouds scanned and reconstructed using BundleFusion, similar to the ScanNet dataset.

Thus, my question is: How to create the preprocessed ScanNet information (extracted and preprocessed) to replicate this information for our own PointClouds?

Many thanks for considering my request.

stefan-ainetter commented 1 year ago

Hi,

here are some details about the preprocessing:

The 'extracted' folder contains the captured RGB-D data and the corresponding camera poses and intrinsics. You should have these data available for your example, right?

According to the data in the 'preprocessed' folder:

The provided '.ply' files are only for visualization purposes. The important part for CAD retrieval is the '.pkl' file, which contains information about the target objects in the scene. The data in the '.pkl' file are saved as ScanNetScene object. I suggest to run the provided example, and look at how the preprocessed ScanNet data are structured. You then have to re-create the same data structure using your own data.

Basically, you have to perform the following steps for preprocessing your data:

1) You already have the 3D semantic instance segmentation from Mask3D. You can use it to calculate a 3D bounding box for each target object pointcloud using e.g. Trimesh to extract a 3D bounding box, see here. 2) You can then calculate center, scale and basis-vectors for the 3D boxes, which are later used to transform the normalized ShapeNet CAD models into the position of the target object. 3) You have to choose which camera views should be used for each object, and save this data in the view_params dict. We did this by reprojecting the target object into the 2D images, and selected the frames where the target object is visible. But you can also do this by manually selecting suitable frames. 4) Save all data as ScanNetScene object, same as in the provided example.

NOTE that all preprocessed data have to be transformed into the PyTorch3D coordinate system, similar to the '.ply' files in the 'preprocessd' folder.

Hope these informations are helpful for you, otherwise let me know if you need more information.

Also, if you are able to share an example of your data I might be able to give a more detailed explanation for your specific use case.

claragarciamoll commented 1 year ago

Thanks for this helpful information. Most of the doubts are solved. In our case, we are working on virtualizing an indoor scene without user intervention. Concretely, we use our scenes aligned using ScanNet coordinate system. In the following images, I show you an example of our scenes:

scene_view_2

scene_view_1

Next, I show you the coordinate system used, height is represented by z axis:

scene_scannet_CS

Until now, we have the information about the instances, as I mentioned previously (I attach an example).

inference_view_2 inference_view_1

Thus I can obtain the bounding box information for each instance (example next).

scene_bbox

We also have layout information, and now we want to get the mapping with CAD models and the alignment between the instance and the CAD model. As I mentioned, most of the required information (pickle variables) is clarified. But, we still need clarification about view_param dict. As I said at the beginning, we don’t want user intervention, so in order to set those 2D images where the target object appears, it must be done automatically. Do you have any method to detect if the target object is in the 2D image? If not, how do you think it could be solved this issue? Is this issue be solved using a detector or a classifier? Related to that problem, other questions came up to me:

Is it required to set all 2D images where the target object appears? Or is there a minimum/maximum number of images?
Is a minimum distance between the 2D images required in terms of translation and/or rotation? I.e., if between 2D images there is small motion (translation/rotation), should it be rejected?
On the other hand, we use the RGB-D camera intelD415 instead of the sensor used in ScanNet dataset. Could this affect the matching process between the target object and the 2D depth image?

Finally, I give you a WeTransfer link where you can download our example scene (reconstructed one: room_0005_00_detection.ply, instance information: room.ply and the .sens): here

Thanks for your help.

stefan-ainetter commented 1 year ago

Regarding your questions:

1.) It is not required to set all 2D images. Using more images increases the chance to find the most suitable CAD model. The maximum number is mainly limited by GPU memory. In general, something between 6 - 20 frames should lead to sufficient results.

2.) Ideally, the camera poses of the selected images cover all 3d points from the target object. However, as stated in the paper, we did the following in our experiments:

"We select a number NT of frames from all frames of the RGB-D scan, by selecting the frames where the target object is in the field of view according to its 3D bounding box and then regularly sample these frames."

3.) Different hardware should not be an issue.

About your data: Thank you for sharing. I briefly looked at it, and I believe it should be possible to preprocess them. If you need some help with the preprocessing of your data (e.g. for automatically selecting suitable 2D images) you can contact me via email (stefan.ainetter@icg.tugraz.at) and I can share some code with you which might be helpful.

florianlanger commented 1 year ago

@claragarciamoll do you mind sharing the code you used for preprocessing the scenes ? Would be super useful for me. Thank you!

RongkunYang commented 1 month ago

Excuse me, may I ask how to obtain the oriented bounding box in the ScanNetv2 dataset?

stefan-ainetter / SCANnotate

ScanNet Preprocessing Scene #3