Closed efreidun closed 1 year ago
Hello!
Thank you for your interest in our work! Here are the answers:
max(e^R, e^t)
error for binning the pose hypotheses. Hope this helps! Axel
Thank you very much for the swift reply. That clarifies it!
May I also ask for some clarification on a doubt I have regarding the validation splits:
In the paper it is mentioned that the validation splits from LoFTR are followed. If I'm not mistaken LoFTR uses 1500 test pairs from SuperGlue for ScanNet, and 1500 sampled pairs from “Sacre Coeur” and “St. Peter’s Square” for MegaDepth. However, as discussed in the appendix, SuperGlue's (also LoFTR's) are trained for pairs with much higher visual overlap. So my understanding is you draw your own image pair samples with [0.1, 0.4] visual overlap for both ScanNet and MegaDepth.
My doubt is:
Thanks again! Fereidoon
Hi!
You are right, we use a "custom" training, validation, and test split with image pairs with little overlap (10%-40%). Here are some clarifications on how we generate the training, validation and test sets:
For indoor datasets, we use the standard training, validation, and test splits, but we do sample our image pairs. If you are interested in the results of the standard ScanNet test set (the one in SuperGlue), we report them in the supplementary material (Table 5). For the outdoor scenes, we use MegaDepth scenes for training and validation. We remove scenes that are in the IMW and in the test set of the PhotoTourism dataset. As you noted, there are "only" two test scenes in MegaDepth, and hence, we use the PhotoTourism test as our testing scenes. Further details in section 5.2.
For training and validation, we sample 90,000 and 30,000 image pairs, respectively. With our configuration, we didn't see further improvements when increasing the dataset size.
As mentioned in answer 1), the training and validation sets are from different scenes than the test scenes we use to report the results in the paper tables.
Hope this helps, thanks! Axel
Ah I think I understand now. Thanks a ton for the clarification! I'll close the issue.
Cheers, Fereidoon
Hi there again,
If I may reopen the issue with another question about the data generation (please let me know if you prefer another channel other than github - e.g. email - for such questions):
I have tried running the provided pretrained model on a set of test samples that I generated from the ScanNet test split with pairs having 0.1 to 0.4 visual overlap score. However, I don't observe the same performance when I compare the metrics that I compute to the ones reported in the paper. Specifically I see noticeably worse performance in the translation component of the final solution after optimization (in median error by ~5 degrees, and in mAA by 0.05).
As these metrics depend a lot on the underlying pool of hypotheses, I'm wondering if you have some additional filtering/preprocessing steps when you produce the training/evaluation samples, for example to remove planar degenerate scenarios? I'm curious because in Figure 1 of appendix I see the error distributions only up to 90 degrees, whereas hypothesis errors can in theory reach 180 degrees.
Thanks in advance!
Best, Fereidoon
Hey there,
Thanks for the follow-up question. Happy to have the conversation here, hopefully, it is also useful for others.
Regarding the drop in performance, I can think of a few reasons why that might be. To sample hypotheses, we use the USAC framework (from OpenCV). We follow the very same pipeline as in MAGSAC++, and hence, we also use all additional checks implemented within it. Besides that, all the hypotheses are refined, and that proves to improve the accuracy of the computed poses. As a side note, to refine the poses, we rely on MAGSAC++ inliers.
Regarding the distribution errors only reaching 90 degrees. That is only true for the translation error, and it is due to the ambiguity of the translation vector in the Essential/Fundamental matrix. See SuperGlue (angle_error_vec) and (compute_pose_error) for more details on how to handle that.
Please let me know if you have further questions. If you would find it useful, I could also add to the repo our test set, although it might take me a bit of time to clean up/prepare that data.
Thanks!
I'm closing this issue now since it did not have activity for a while. Do please feel free to reopen it if you have any other questions!
Thanks a lot!
Hi there,
Thank you for the interesting work. I have a couple of questions regarding batch generation for the purpose of training the pipeline.
In the paper it is mentioned that batches of 56 image pairs are used at training, and 500 hypotheses are clustered into bins based on their pose error prior to sampling. I'd like to ask about this binning and sampling process:
max(e^R, e^t)
?Thanks in advance! Fereidoon