Closed Aniket-Gujarathi closed 3 years ago
The scenes you are training on are extremely repetitive so that's probably why triplet collapse happen - your negatives are not visual negatives but rather geometric negatives which are hard to distinguish from image patches. Moreover, the RGB noise in the left image is much more accentuated than in the right image and the rotation is way to high - not sure if that can be handled by regular CNNs. Can you try training on less extreme viewpoints (10-20 degrees change maximum)?
As a sanity check, did you try running the --plot
option of the training script to make sure that the matches used for training look good? Could you post some images?
Hi, Thanks for the reply. However I didn't understand what you meant by 'your negatives are not visual negatives but rather geometric negatives which are hard to distinguish from image patches'. Could you please elaborate on it?
I did use --plot option while training and mostly got erroneous visualizations like the one here (example). Also the matches do not seem correct after the training (example). The distances(positive and negative) seem to be reducing monotonically after every image input to very small values and the scores remain constant (at 0.002) for every keypoint.
Could you also please suggest some other dataset than MegaDepth(as the size of this dataset is too large for me to be able to download) to verify the results?
Visual negatives can be distinguished by looking at image patches directly while geometric negatives require knowledge of scene geometry to distinguish. A classic example of a geometric negative is a repetitive structure - e.g., looking at a patch around different bricks is not enough to say they are different since the patch contents are very similar.
Similarly, in your case. the image regions around your negatives are very similar to the ones around your positives so the network can't learn to distinguish them. A network operating on single images only will have a lot of trouble distinguishing geometric negatives.
If the visualisations using --plot
seem erroneous to you then it means there's an issue in your annotations. The plotted correspondences do not come from the network but directly from the annotations you use for training. Moreover, I don't think that's how the annotations are supposed to look; the grid is way too dense given that the network uses downsampling to 1/8th the resolution. As you can see in the following code snippet, for MegaDepth, we only warp the grid locations, not all pixels in the image:
https://github.com/mihaidusmanu/d2-net/blob/2a4d88fbe84961a3a17c46adb6d16a94b87020c5/lib/loss.py#L60-L69
The wrong matches in the end are normal since the network probably collapsed to a trivial solution.
Sadly, we only trained on MegaDepth - I am not aware of any other datasets that can be easily integrated in the pipeline.
Hi,
Thank you for the explanation on the visual and geometric negatives. I tried to check the annotations, and the ground truth correspondences seem correct while training(example), however the score maps seem bad (example) as every keypoint is getting the same soft-detection scores (0.0002) and due to the small values of difference of the distances, the loss is getting stuck at the margin:
Do you know what could be causing the scores to be constant as seen above?
PS. I have tried on different custom dataset as well, but even on them the scores come out as constant values of 0.0002.
As I explained above, I think the issue is the negative mining which is way too hard for the network to converge to a meaningful solution on your dataset. I suspect that if you look at the descriptors, they will be very similar pointing to triplet collapse.
Thank you for your suggestions. I would try to make some changes like you advised and see if the problem still persists.
Hi, I trained D2Net on the PhotoTourism Dataset and it worked fine using the standard procedure(Results). However, instead of using the warp function, when I tried to generate the ground truth correspondences manually(Link) on image pairs formed by rotating the source image by a small angle(0-2 degrees in-plane rotation), the loss still collapsed to the margin(Link). I provided the ids required for the positions by taking the row-major index of the valid points on the grid as discussed in the issue #72. As discussed before, the problem of the image pairs having geometric negatives may not be the case for the images in PhotoTourism data. So, what do you think could be the issue here?
Please see my comment from above: "Moreover, I don't think that's how the annotations are supposed to look; the grid is way too dense given that the network uses downsampling to 1/8th the resolution. As you can see in the following code snippet, for MegaDepth, we only warp the grid locations, not all pixels in the image."
The "manual" annotations you are providing are not in the correct format. When training on MegaDepth data, the correspondences are associated to grid points in the left image https://drive.google.com/file/d/1PUd1lTcX4GUKodJDcIbv-ZqMJc_uog5p/view, while in your manual annotations https://drive.google.com/file/d/1wVpEBNqPNwoDxusiXT43zDCOScIqz-Iy/view the points in the left image seem random. You should warp the pos1 positions which are given as input to the warping function, not random points.
Thank you sir for your reply. Yes, the random annotations were causing the issue. Now, it is working properly after getting the correspondences in the proper format associated with the grid points.
Hello sir,
I am trying to train on my custom dataset(example) using D2Net (10 epochs, learning rate - 0.0003), but the loss keeps getting stuck at the margin.