mihaidusmanu / d2-net

D2-Net: A Trainable CNN for Joint Description and Detection of Local Features
Other
761 stars 163 forks source link

Distances decreasing after every consecutive input #73

Closed Aniket-Gujarathi closed 3 years ago

Aniket-Gujarathi commented 3 years ago

Hello sir,

I am trying to train on my custom dataset(example) using D2Net (10 epochs, learning rate - 0.0003), but the loss keeps getting stuck at the margin.

  1. I figured out that this may be due to triplet collapse and also tried semi-hard negative mining instead of hard-mining. However, in both the cases the negative_distance, positive_distance and the distance_matrix in the loss keep decreasing rapidly and the loss gets stuck at the margin (plot). distances
  2. In the above image, the first tensor is the negative_distance (in the loss) whose value is falling rapidly after each input, and the second tensor is the positive_distance which is also decreasing (A similar trend is also seen in the distance_matrix). I fail to understand why these distances are showing a decreasing trend in consecutive images instead of the desired trend to be seen after every epoch. Could you please help me understand this?
mihaidusmanu commented 3 years ago

The scenes you are training on are extremely repetitive so that's probably why triplet collapse happen - your negatives are not visual negatives but rather geometric negatives which are hard to distinguish from image patches. Moreover, the RGB noise in the left image is much more accentuated than in the right image and the rotation is way to high - not sure if that can be handled by regular CNNs. Can you try training on less extreme viewpoints (10-20 degrees change maximum)?

As a sanity check, did you try running the --plot option of the training script to make sure that the matches used for training look good? Could you post some images?

Aniket-Gujarathi commented 3 years ago

Hi, Thanks for the reply. However I didn't understand what you meant by 'your negatives are not visual negatives but rather geometric negatives which are hard to distinguish from image patches'. Could you please elaborate on it?

I did use --plot option while training and mostly got erroneous visualizations like the one here (example). Also the matches do not seem correct after the training (example). The distances(positive and negative) seem to be reducing monotonically after every image input to very small values and the scores remain constant (at 0.002) for every keypoint.

Could you also please suggest some other dataset than MegaDepth(as the size of this dataset is too large for me to be able to download) to verify the results?

mihaidusmanu commented 3 years ago

Visual negatives can be distinguished by looking at image patches directly while geometric negatives require knowledge of scene geometry to distinguish. A classic example of a geometric negative is a repetitive structure - e.g., looking at a patch around different bricks is not enough to say they are different since the patch contents are very similar.

Similarly, in your case. the image regions around your negatives are very similar to the ones around your positives so the network can't learn to distinguish them. A network operating on single images only will have a lot of trouble distinguishing geometric negatives.

If the visualisations using --plot seem erroneous to you then it means there's an issue in your annotations. The plotted correspondences do not come from the network but directly from the annotations you use for training. Moreover, I don't think that's how the annotations are supposed to look; the grid is way too dense given that the network uses downsampling to 1/8th the resolution. As you can see in the following code snippet, for MegaDepth, we only warp the grid locations, not all pixels in the image: https://github.com/mihaidusmanu/d2-net/blob/2a4d88fbe84961a3a17c46adb6d16a94b87020c5/lib/loss.py#L60-L69

The wrong matches in the end are normal since the network probably collapsed to a trivial solution.

Sadly, we only trained on MegaDepth - I am not aware of any other datasets that can be easily integrated in the pipeline.

Aniket-Gujarathi commented 3 years ago

Hi, Thank you for the explanation on the visual and geometric negatives. I tried to check the annotations, and the ground truth correspondences seem correct while training(example), however the score maps seem bad (example) as every keypoint is getting the same soft-detection scores (0.0002) and due to the small values of difference of the distances, the loss is getting stuck at the margin: score Do you know what could be causing the scores to be constant as seen above? PS. I have tried on different custom dataset as well, but even on them the scores come out as constant values of 0.0002.

mihaidusmanu commented 3 years ago

As I explained above, I think the issue is the negative mining which is way too hard for the network to converge to a meaningful solution on your dataset. I suspect that if you look at the descriptors, they will be very similar pointing to triplet collapse.

Aniket-Gujarathi commented 3 years ago

Thank you for your suggestions. I would try to make some changes like you advised and see if the problem still persists.

Aniket-Gujarathi commented 3 years ago

Hi, I trained D2Net on the PhotoTourism Dataset and it worked fine using the standard procedure(Results). However, instead of using the warp function, when I tried to generate the ground truth correspondences manually(Link) on image pairs formed by rotating the source image by a small angle(0-2 degrees in-plane rotation), the loss still collapsed to the margin(Link). I provided the ids required for the positions by taking the row-major index of the valid points on the grid as discussed in the issue #72. As discussed before, the problem of the image pairs having geometric negatives may not be the case for the images in PhotoTourism data. So, what do you think could be the issue here?

mihaidusmanu commented 3 years ago

Please see my comment from above: "Moreover, I don't think that's how the annotations are supposed to look; the grid is way too dense given that the network uses downsampling to 1/8th the resolution. As you can see in the following code snippet, for MegaDepth, we only warp the grid locations, not all pixels in the image."

The "manual" annotations you are providing are not in the correct format. When training on MegaDepth data, the correspondences are associated to grid points in the left image https://drive.google.com/file/d/1PUd1lTcX4GUKodJDcIbv-ZqMJc_uog5p/view, while in your manual annotations https://drive.google.com/file/d/1wVpEBNqPNwoDxusiXT43zDCOScIqz-Iy/view the points in the left image seem random. You should warp the pos1 positions which are given as input to the warping function, not random points.

https://github.com/mihaidusmanu/d2-net/blob/2a4d88fbe84961a3a17c46adb6d16a94b87020c5/lib/loss.py#L60-L69

Aniket-Gujarathi commented 3 years ago

Thank you sir for your reply. Yes, the random annotations were causing the issue. Now, it is working properly after getting the correspondences in the proper format associated with the grid points.