The consumption of ScanNet RGB-D vedio dataset

mbanani / unsupervisedRR

[CVPR 2021 - Oral] UnsupervisedR&R: Unsupervised Point Cloud Registration via Differentiable Rendering

https://mbanani.github.io/unsupervisedrr

MIT License

136 stars 20 forks source link

The consumption of ScanNet RGB-D vedio dataset #2

Closed lzhnb closed 3 years ago

lzhnb commented 3 years ago

Thanks for your great work. Would you like to share the final storge occupancy of ScanNet RGB-D video dataset? I first download the .scens of scene0000_00 and utilize the reader function provided by ScanNet, and I found the storge occupancy of this scene is over 5G. The number of training scenes is over 1000, so it will lead to an expansive storage cost. Limited by my hardware conditions, I need to know the final storage occupancy of the processed dataset.

And the time cost of processing seems to be a lot? How much time the whole processing will be consumed?

Will this project use all scenes of ScanNet as training examples? Or take only 3 scenes as training examples listed in the datasets.md in docs?

Thanks!

lzhnb commented 3 years ago

Furthermore, what SensReader code did you use? ScanNet provides two types of SenReader mode: c++ and python. The python is too slow. But I found the RGB results of both are different. This difference is not recognizable to the eye, but the two images are not equal in value.

And I've asked the issue in #93 in ScanNet.

mbanani commented 3 years ago

Hi @lzhnb ,

Regarding storage, I extracted the full dataset (the 3 scenes provided in datasets.md are only to explain the layout of the directory. I don't remember exactly how much time it took, but I remember having to parallelize this on a cluster to get it done quickly (and even then, it took several hours if I recall correctly).

As for space, ScanNet is quite large (~2 TBs). You can get around this by resizing (256x256) and sampling the frames at 20 which should reduce it by a factor of around 300.

Finally, as for the SensReader, I used the python version. I am unaware of the discrepancy between the two readers, my guess is that it shouldn't matter that much. If they're both writing to JPEG could it be different compression levels? I'll note that response time on the ScanNet repo is often very slow (with a lot of posts never responded to) so keep that in mind.

Let me know if you have any other questions.

lzhnb commented 3 years ago

Thanks, @mbanani ,

I read the code. And I also have some questions:

You mean sampling the frames at 20, is it corresponding to the _C.DATASET.view_spacing = 20 in config? If I uniformly sample the frames as you said, the view_spacing should be set to 1, right?
And is the resizing (256x256) corresponding to the _C.DATASET.img_dim = 128? Did you perform experiments on other image sizes?
The last question is why do you use the intrinsic_color as K to feed into grid_to_pointcloud function to generate point cloud? Why not use the intrinsic_depth? Because the image sizes of the RGB and the depth of ScanNet are different, the RGB's is (968, 1296) and the depth's is (480, 640). In the __getitem__, you first resize and crop the RGB as square, and let the transformation is reflected on the K. Next, you crop the depth as square and interpolate it into the image size. Do this operation align each other?

lzhnb commented 3 years ago

To confirm my assumation, the K can be adopted to convert the depth into point cloud after transformation in __getitem__: In your __getitem__, you change the intrinsic_color according to the transformtion on RGB image (resize and crop). I do the same operation on intrinsic_depth:

# -- Transform K to handle image resize and crop
K[0, 2] -= crop_offset  # handle cropped width
K[:2, :] *= self.image_dim / smaller_dim  # handle resizing

where the crop_offset and smaller_dim are obtained from depth's parameters. If the K(intrinsic_color) can be used to convert the depth into point cloud, both intrinsic matrix should be equal. I found the K of intrinsic_color is: (after __getitem__ operation)

array([[154.14519299,   0.        ,  64.14478955,   0.        ],
       [  0.        , 154.14717845,  64.10115901,   0.        ],
       [  0.        ,   0.        ,   1.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   1.        ]])

And the K of intrinsic_depth is:

array([[153.510856  ,   0.        ,  64.07653813,   0.        ],
       [  0.        , 154.147168  ,  64.03376453,   0.        ],
       [  0.        ,   0.        ,   1.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   1.        ]])

They are a little different, which is caused by the difference of aspect ratio(depth: 480:640=0.75, rgb: 968:1296=0.7469).

Have you discussed the impact of such subtle difference? Thanks!

mbanani commented 3 years ago

View spacing only determines the spacing between the sampled pair. After that, you get the choice of whether you want to sample pairs start from each frame ((0, 20), (1, 21), ...) or a stided sampling ((0, 20), (20, 40), ..). This is discussed here in the code. I was suggesting that you could only save the frames every 20th frame, and that way, you could save storage. If you then use strided sampling with a view_spacing of 20, then you will only save the frames that you would have actually considered.
Yes, you can go down to 128x128 if you want. My experiments in this paper were only done on 128x128.
As you noted, the intrinsics are slightly different but very close to each other. I didn't inspect how much of an issue this caused. It likely would be more of an issue for high-resolution rendering, but given that it's all sub-pixel (the difference between them is less than 1 pixel in the resolution used in experiments), I don't think it would cause an issue.

Hope this addresses your concerns.

lzhnb commented 3 years ago

Thanks for the hint, I re-read the code.

I found that the num_frames here mainly determine the size of the dataset. Assuming the number of all frames is N and the strided here is set to True, the size of the dataset is around N / self.view_spacing. Otherwise, the size of the dataset is around N. And the strided = split in ["valid", "test"] here. It means that the size of the training dataset will be so huge(around the number of all frames), and the validation of the testing dataset is small (around view_spacing times smaller than the number of all frames). Is it reversed here?

Besides, I think my first question may not be very clear, I am sorry. As you said, such data preprocessing can save space:

As for space, ScanNet is quite large (~2 TBs). You can get around this by resizing (256x256) and sampling the frames at 20 which should reduce it by a factor of around 300.

If I perform "sampling the frames at 20" in data preprocessing, the view_spacing of the dataset should be set to 1, right? If the view_spacing maintains as 20, the two frames in the frame pair are actually 400 frames apart.

lzhnb commented 3 years ago

Ok, I found the comment:

# The option to do strided frame pairs is to under sample Validation and Test
# sets since there's a huge number of frames to start with.

And this is in line with what you stated in the paper:

We generate view pairs by sampling image pairs that are 20 frames apart. We sample the training scenes more densely by sampling all pairs that are 20 frames apart. This results in 1594k/12.6k/26k ScanNet pairs.

It means that I should read and unpack all training frames(the .sens files), and I can not perform "sampling the frames at 20" on traning scenes, right?

mbanani commented 3 years ago

As you noted in your comments, the paper does sample the training dataset more densely. However, this requires a lot of space. I was suggesting sampling the training set a bit more sparsely given your concerns about storage space. It will probably result in a drop in performance, I am not sure how significant that drop will be given that even after sampling every 20 frames, ScanNet will still be quite large. You could also do a middle ground by generating every 5th frames and sampling accordingly ((0, 20), (5, 25), ....).

Also, yes, if you generate every 20th frame, then view_spacing should be set to 1.

mbanani commented 3 years ago

I will go ahead and close the issue since this seems to be resolved. Please feel free to reopen it or submit another issue if you have any more questions.