Questions about the implementation

VladimirYugay commented 1 year ago

Hey there. Thanks for such a well-documented code.

In get_correspondences, what is the need for the second channel (P, Q) of the input tensors (N x P x F) a (N x Q x F)? Does it correspond to the number of correspondences? If yes, why is the shape of the 3d points (N, P, 3) instead of (P, 3)?
Somehow related to the first question, what's the point of optimizing w.r.t. different sets of correspondences Q and P? While they might not be bijective, isn't optimizing them leads eventually to bijective mapping? Therefore, wouldn't it be "easier"/"more efficient" to keep only 1 to 1 correspondences?
Here, pts_ref FloatTensor (N x C x 3) reference points, is N is the batch size, C is the number of points?

mbanani commented 1 year ago

Hi Vladimir! Thank you for your interest, hope this answers your questions:

This function does a batched computation of the correspondence estimation, so N is the the number of instances in the batch and P/Q are the number of points in the source and target point clouds. Sorry of the comments were confusing.
You are correct that the final set of estimated correspondence should be a bijection, however, the initial set is noisy so some points do not have a correspondence. Another issue is that most depth maps are not complete (with missing values), so some points are invalid. To simplify processing, I keep all points to have allow for homogenous batches (with all point clouds having the same size) by simply keeping invalid depth points but assigning them a value of 0. I can then filter them out here. You could do heterogenous processing, but I just kept it simple (but retained P,Q as different variables to remind myself of that). In the end, it is simpler to estimate a correspondence for each point and let the network give it a very low weight score so it gets filtered out later on.
Yes, sorry for the confusing comments here as well. It seems like I mixed up N and batch_size in the comments. The function takes as input a batch of correspondences and in the middle does a randomized estimation (similar in spirit to RANSAC) where I also use N to refer to the number of subsets. You can also check a different version of the same idea in this function if my follow-up work, SyncMatch where we basically do RANSAC instead.

Hope this helps and let me know if you have any more questions!

VladimirYugay commented 1 year ago

Thanks for such a quick response. Yes, makes sense.

Another question is about the comparison between RANSAC + SuperPoint. Is that right that you're using features extracted from a 2D image, not the point cloud? It seems that SuperPoint works on images only.

Also, where does the method stand compared to smth old like FPFH?

mbanani commented 1 year ago

My pleasure, glad I could help.

Regarding SuperPoint: yes, I extracted the features for the image and then lifted the keypoints to 3D using depth and intrinsics. This was the same for all the image-based baselines.

For FPFH, I compared against it in BYOC which also learned geometric features from the uncolored pointclouds as well as visual features from the image. It's also a simpler pipeline than this papers since it only relied on a 3D correspondence loss while this paper used both a photometric rendering-based loss and a 3D correspondence loss.

mbanani / unsupervisedRR

Questions about the implementation #9