rpautrat / SuperPoint

Efficient neural feature detector and descriptor
MIT License
1.85k stars 414 forks source link

Understanding architecture #95

Closed UCRajkumar closed 4 years ago

UCRajkumar commented 5 years ago

I'm having a bit of trouble understanding the architecture for this project.

  1. Input an image of some size HxW is provided at the start of the network. In the Demo, it seems that we can provide different HxW for the input image. How exactly is this possible. If the network was originally trained on some fixed H and W, then wouldn't the new images regardless of it's input size have to be reshaped to the same H and W to that the weights can properly attach to each pixel? But in the code, it seems the opposite is happening. Can you please clarify how this is possible?
  2. What exactly are the dimensions and values of the ground truths? For instance, let's say for the synthetic shapes, you provide a square as the input image. Then, is the ground truth a binary semantic segmentation map or is it a single classification label? Really confused on this part here.
  3. Likewise, what exactly are the dimensions and values of the ground truth of the descriptors (e.g. for MS-COCO).
  4. Lastly, what exactly is happening in this snippet
rpautrat commented 5 years ago
  1. This network is based on convolutions, so it can work for any size of image. In practice, it is better if the dimensions are multiple of 8 (because this method relies on 8x8 patches in the image), but it would still work for other dimensions. The training is performed at multiple resolution such that the network can generalize to details at various resolution. So the size is totally flexible.

  2. The ground truth is a binary image of the same size as the input image with 1s at the locations of interest points and 0s elsewhere.

  3. There is no ground truth for the descriptors, please read the original article: https://arxiv.org/pdf/1712.07629.pdf

  4. They are here performing a bilinear interpolation to go from an image of descriptors 256 H/8 W/8 to a full image descriptor of 256 H W. In the original article they were actually using bi-cubic interpolation, but they found out that bilinear was actually enough and faster.

UCRajkumar commented 5 years ago

Regarding point 3, I see that you used hinge loss. But if I'm following right, this would happen: Let's say for sake of example that descriptor has only 5 values instead of 256. And let's say the descriptor for image 1 has values 0.1, 0.1, 0.1, 0.1, 0.1 and exactly the same for image 2. And these two images are in fact corresponding, that is, $s=1$. In that case, according to your hinge loss, it should be l_d = gamma_d * s *max(0, m_p - d^T * d') = 250*1*max(0, 1 - 0.05). This loss is actually 237. Which is actually quite high. Shouldn't the loss in this situation be 0? Or are you trying to maximize

rpautrat commented 5 years ago

In this formula, the descriptors are supposed to be L2 normalized such that if the descriptors are the same, then their scalar product will be 1 and the loss 0.

However, I admit that in the current code, the loss is computed with the downsampled descriptors ( on image Hc x Wc), which are not normalized. So you are right that this may be an issue here. Thanks for pointing this out, I will definitely try to retrain by first normalizing the descriptors that are used in the computation of the loss.

UCRajkumar commented 5 years ago

Thank you, please update this post once you have uploaded the new training. I'm keen to know how the results turned out.

Another question is, I don't quite understand the intuition between using softmax in the interest point decoder. Why not use sigmoid? Why explicitly softmax? The only logical explanation I can think of is, when you are reshaping a pixel to an 8x8, you're forcing all the pixels in that region to sum up to 1. But can't this cause multiple key points to be detected in the same cell?

If the above is not the right interpretation, then the question is why use softmax?

UCRajkumar commented 5 years ago

Similarly, why is there a new interest point dustbin channel? Or rather why do you think that 1 specific channel has no interest points? Why not just go from 256 directly to 64 instead of to 65 then dropping to get 64?

rpautrat commented 5 years ago

Your assumption is correct, the softmax is here a way to normalize the responses within an 8x8 patch, to make them sum to 1. This should actually prevent most of the time to have multiple detections in the same patch, since the response is spread among the 65 bins and the detection threshold is usually 1/65 or even less. But it still allows to have multiple interest points in the same patch, which is sometimes beneficial.

The dustbin channel is necessary when there should be no interest point in the patch (e.g. with a textureless uniform patch). Without it, the softmax would spread the responses across the 64 bins and at least one of them would be selected as a keypoint because its response would be higher than the detection threshold. So ideally the network should learn that when there is no keypoint it can put all the responses of the softmax in the 65th dustbin and ignore the 64 other bins.

UCRajkumar commented 5 years ago

Ahh i see, the first point definitely makes sense. However, for the second point, a bit confused on the final sentence. You're saying "it can put all the responses." What exactly is meant by this? As I understand it, you're applying softmax on 65 feature maps causing all the values along the depth dimension to sum up to 1.

Now, the 65th bin contains the response for the case that there are no keypoints. Using this logic, the other 64 bins contain 64 new potential key points. How do you know which bin is the "dustbin" in order to properly remove it?

UCRajkumar commented 5 years ago

Based on the correspondence formula, I want to make sure I properly understand the criteria for s_hwh'w' to equal 1 or 0. So, if the absolute distance between the center pixels of 2 cells (8x8 patches) is less than 8 pixel distance, then these locations are considered positive correspondences. All other pairs are considered negative correspondences, is that correct? In essence, only the cells that are directly above, to the left, to the right, and below are considered as positive correspondences. Was that the idea?

rpautrat commented 5 years ago

Yes, the responses across the 65 values in the channel dimension should sum to 1. The 64 first ones are indeed potential keypoints (once reshaped in an 8x8 patch), and the dustbin is always the last value in the channel dimension, i.e. the 65th bin. So you always remove the same bin, the 65th one. For example when there is no keypoint in the patch, the value of the 64th first bins should ideally be 0 and the 65th should equals to 1.

As for the quantity s_hwh'w', your understanding is correct. However I did here a slight change from the original article: when the homography between the two images is the identity function, we don't want that the center of the upper/left/right/lower cell to be a positive match with the current center. They should be considered as negative in my opinion, and only the center of the current cell must be a positive match between the two images. That is why I reduced this pixel threshold of 8 to a bit less (7.5 I think in practice), so that center of cells at exactly a distance of 8 are not considered as positive. This improved the results in a noticeable way.

UCRajkumar commented 5 years ago

Ahh that makes a lot of sense thank you. And just to be clear, if there are 20 "cells," and I pick some center cell, assuming no identity homography, then I will will have 5 positive correspondences and 15 negative correspondences. Is that right? If so, isn't that a terribly expensive operation to do a combinatoric comparison of 1 cell with every other cell?

Some other difference I noticed are:

  1. lambda_d in your implementation equals 800 in your config file, instead of the 250. Can you please clarify this?
  2. What exactly does the top_k (in the config) variable do?
  3. I see evaluations file in which you are computing repeatability and homography estimation. But I want to make sure I'm interpreting the code in the intended way. If you could provide explanation of the two metrics, either using equations or even a detailed description would be really helpful!
rpautrat commented 5 years ago

Technically, you should only have one single positive match per cell, even without an identity homography function. That is, 1 positive and 19 negative correspondences in your example. This is due to the fact that the center of cells are spaced by 8 pixels and the threshold to consider a correspondence positive is also 8 (or slightly less as I explained above). We really want the descriptor to be close only to the exact same patch in the warped image, and not to the neighboring patches.

Indeed, checking every pairs of such centers of cells represents a lot of combinations. However we don't do it sequentially and in practice this check is vectorized in the code and is only one huge matrix operation in Tensorflow. The main problem is that we have to load this huge matrix comparing the pairwise distances of each centers in the GPU memory and that is why this code needs so much GPU memory.

About your other questions:

  1. I tuned the lambda_d parameter to get better results. Its original role is to balance the fact that when considering a single center of cell, for one positive match we have many more negative matches. In our case the training is performed on COCO images resized to 240x320, so we have 240/8 x 320/8 = 1200 centers of cells. Thus we expect 1 positive match for 1199 negative matches. I tried different values between 250 and 1200 for lambda_d and found out that 800 was giving the best results in my case.

  2. top_k specifies the max number of features that you want to keep in your image. The detection map is first filtered out by removing the responses lower than detection_threshold and then we retain the top_k points with highest response. It's actually not used during training, but plays an important role when exporting the detections and for the evaluation.

  3. For the two evaluation metrics (repeatability and homography adaptation) I strictly followed the description given in the appendix A of the original article on page 10 (https://arxiv.org/pdf/1712.07629.pdf). They provide there a good description of the metrics with equations.

UCRajkumar commented 5 years ago

Thank you for the explanation for the other questions.

Regarding the correspondences, that's slightly different from what I've understood. If a cell is 8x8, then the center pixel is 4 pixels from the edge. If you consider two adjacent cells, the centers will be exactly 8 pixels away, satisfying the property s_hwh'w' = 1. And there can be exactly 4 adjacent cells due to the 4 sides of the cell. Hence, 4 adjacent cells + center cell = 5 cells which give you five s_hwh'w' =1 (i.e. 5 positive correspondences). So, ignoring the identity homography case, how can you only have 1 positive correspondence?

rpautrat commented 5 years ago

Once again, I reduced the pixel threshold from 8 to 7.5 to avoid having the neighboring centers considered as positive matches.

UCRajkumar commented 5 years ago

ahh i see, so for both cases: identity and non-identity homographies, you ensured that only the center patch is considered a positive correspondence? In your previous explanation, it seemed like the threshold was 7.5 only for the identity homography, that's where I was confused.

rpautrat commented 5 years ago

Ah yes sorry, the identity homography was just an example. But the threshold of 7.5 is indeed valid for all homographies, not just the identity.

UCRajkumar commented 5 years ago

The repeatability metric from the paper and the code don't seem to quite agree from my understanding. From the paper, my impression is the following. Given equation (14), we have summation of corr() values for all keypoints in reference image and all keypoints in warped image, divided by the total number of keypoints for the 2 images."

  1. corr(x_i) implies we're checking if the predicted keypoint in the REFERENCE image is correct using groundtruth keypoint of REFERENCE image. The ground truth keypoint is obtained by applying the inverse homography on the warped image. (the homography is provided to us by the dataset).
  2. corr(x_j) implies we're checking if the predicted keypoint in the WARPED image is correct using groundtruth keypoint of WARPED image. The ground truth keypoint is obtained by applying the homography on the reference image.

However, in the code, you are only doing true_warped minus warped and then looking at axes 0 and 1. So essentially, you're only getting the sum of (A) the number of predicted keypoints that align with ground truth keypoints IN THE WAPRED IMAGE and (B) the number of GT keypoints that align with the predicted keypoints IN THE WARPED IMAGE.

Can you please clarify this?

rpautrat commented 5 years ago

I see what you mean, there are indeed different ways to interpret the equation (14) and your interpretation seems more logical now that you point it out. I think that there shouldn't be a big difference between the two views, but I will implement your interpretation in my next update of the code (I will soon push an updated version of the code, if my last changes improve the current model).

Thanks again for this careful reading of this code!

UCRajkumar commented 5 years ago

Great! From my end, the relative changes seem pretty consistent. Is there any way to track both the training and validation loss? I wanted to visualize the training and validation loss per epoch for the training of SuperPoint. But can't seem to figure out how to do that given the current output.

rpautrat commented 5 years ago

Currently the code print the training loss and evaluation metrics. You can modify this in the file base_model.py, line 319. There you can try to evaluate the loss on the validation set and print it (you would need to modify the function 'evaluate' to return not only the metrics but also the loss).

UCRajkumar commented 5 years ago

I see, got it thank you! It seems the network is mostly lacking in homography estimation for viewpoint changes. Is there any reason you're limiting the level of warping during training? For instance, using bigger values for max_angle, etc, in the superpoint_coco.yaml file.

Along those same lines, what exactly do the variables perspective_amplitude_x, perspective_amplitude_xy, patch_ratio, and max_angle? I believe I have an idea, but want to confirm what the intention was.

rpautrat commented 5 years ago

I tried to increase the amount of warping, but it gave worse results (the changes were too difficult to learn I suppose). And since I was evaluating on HPatches, where the viewpoint changes are not so important (in particular in angle), it didn't make sense to use to significant warping.

All the parameters controling the homography are described in the code, in homographies.py in the function 'sample_homography'.

UCRajkumar commented 5 years ago

I had a question about the loss propagation. Don't quite understand it from the paper and wanted to clarify what I see in the code.

For SuperPoint: There is a detector loss and a descriptor loss, one for each branch. And an overall loss function, that incorporates both losses. (Option 1) Is this single value from the overall loss function back propagated through both branches? Or, (option 2) is it that the detector loss is propagated through the detector branch and the descriptor loss is propagated through the descriptor? I don't think it is option 2, otherwise, there would be no need to combine both losses into one. Hence, if it's option 1, then isn't there a possibility that if there's a huge disparity between detector and descriptor accuracy, then when the loss is propagated, it can unduly affect the wrong branch?

For MagicPoint: I presume only the detector loss, is that correct?

rpautrat commented 5 years ago

Yes, it is option 1 for SuperPoint and only detector loss for MagicPoint. And yes, if one of the two branch is too weak compared to the other, it can penalize the good branch. This is why balancing the two losses is not so easy and is quite important. But having the shared training can also improve the quality of the two components (detector and descriptor), because the two are somewhat interconnected and it can be beneficial to know the descriptors to find keypoints for example (or vice-versa).

UCRajkumar commented 5 years ago

Any update on the l2 normalization of the descriptors before the loss? Refering to this post. I would like to train it myself, but am having some trouble with implementing it such that the shape agrees properly. If the code is updated on a separate branch, I can train and provide updated results for you.

UCRajkumar commented 4 years ago

I suspect something very tricky is happening in the paper that was not caught. Consider the following: although we agreed in an earlier post that the loss is computed on the L2-normalized values of the output tensor from the descriptor, I don't believe this is actually the case. According to the paper, L2-normalization occurs after the bi-linear interpolation. Again, according to the paper, the descriptor vectors that are dot-producted in the descriptor loss are coming from "D" which is of shape H_c x W_c x 256. This indicates that the loss is computed before the interpolation operation and consequently before the L2-normalization. If the values are NOT normalized, then when computing the loss, it makes no sense to do "positive_margin minus loss" as the loss is no longer confined between [0, 1].

For example, if two descriptor vectors (each of size 5) all had a value of 0.9, then their dot product would be 4.05 where as if they were all 0.1, then their dot product would be 0.05. Both cases are positive correspondences, but in one case, the loss would be 0 and in the latter case, the loss would be 237 (250max(0, 1- d^T d').

rpautrat commented 4 years ago

Any update on the l2 normalization of the descriptors before the loss? Refering to this post. I would like to train it myself, but am having some trouble with implementing it such that the shape agrees properly. If the code is updated on a separate branch, I can train and provide updated results for you.

I have implemented it and I have also added several other improvements from other issues and remarks. I am currently retraining the whole pipeline in order to evaluate the impact of the new changes. This will take a few days, but I will keep you updated on the results and push the changes to a new branch if it really improves.

I suspect something very tricky is happening in the paper that was not caught. Consider the following: although we agreed in an earlier post that the loss is computed on the L2-normalized values of the output tensor from the descriptor, I don't believe this is actually the case. According to the paper, L2-normalization occurs after the bi-linear interpolation. Again, according to the paper, the descriptor vectors that are dot-producted in the descriptor loss are coming from "D" which is of shape H_c x W_c x 256. This indicates that the loss is computed before the interpolation operation and consequently before the L2-normalization. If the values are NOT normalized, then when computing the loss, it makes no sense to do "positive_margin minus loss" as the loss is no longer confined between [0, 1].

For example, if two descriptor vectors (each of size 5) all had a value of 0.9, then their dot product would be 4.05 where as if they were all 0.1, then their dot product would be 0.05. Both cases are positive correspondences, but in one case, the loss would be 0 and in the latter case, the loss would be 237 (250max(0, 1- d^T d').

Yes, that's what I realized and that I was meaning in our previous conversation about the normalization of the descriptor (https://github.com/rpautrat/SuperPoint/issues/95#issuecomment-518119998). The descriptors have to be normalized before the computation of the loss (and after interpolation as well). This is already part of my new improvements.

UCRajkumar commented 4 years ago

When I looked into the valid_mask, it seems like it's just a matrix of ones, and when it's multiplied inside tensorflow's loss function, it's not transformed in anyway, so it seems like it's just multiplying the loss tensor by all 1s, before the averaging happens. Can you please clarify this?

rpautrat commented 4 years ago

Are you sure? If the homographic augmentation is turned on with artifacts allowed, the function homographic_augmentation in datasets/utils/pipeline.py should compute a valid mask which is not just full of 1s.

zpfriedel commented 4 years ago

@rpautrat Curious to know if you were able to improve results when implementing the various new changes, specifically the normalization of the descriptors before the loss? Any insight you can share would be awesome, thank you!

rpautrat commented 4 years ago

So most of the new changes didn't change the performance much, except that they improved the homographic adaptation (now there are no more multiple detections around a single interest point as it was happening before). However adding the normalization of the descriptors before the loss drastically decreases the performance of the descriptors. It could be that it is due to the fact that the parameters in the descriptor loss need to be tuned again to be adapted to the normalized descriptors. So I am currently running more experiments with different parameters to see if it helps.

zpfriedel commented 4 years ago

At least there were some improvements in the pipeline! I'm curious about what you changed to improve the performance of homographic adaptation if you don't mind sharing here, and have those new changes been uploaded yet? That makes sense that the parameters would need to be tuned again for the descriptor loss. No luck with the parameters reported in the SuperPoint paper?

rpautrat commented 4 years ago

I will upload the (small) changes for homographic adaptation with the other changes soon. I am currently retraining with the parameters of the original paper, so we will see.

rpautrat commented 4 years ago

@zpfriedel @UCRajkumar, it seems that the normalization of the descriptors in the loss decreases the performance of the descriptors, whether I use the parameters of the original paper (it's far worse in that case), mine or new parameters. So I am not sure what is wrong here and why the normalization of the descriptors impacts the quality of the descriptors so much.

What is weird is that the descriptors still work with normalization in the loss function, it's just that they get worse (roughly the same results as ORB in homograpy estimation). And I tried to change a few parameters during training, but it didn't improve the performance so far.

So I guess that I won't use this normalization in the new update, unless you have an explanation for this loss of performance.

UCRajkumar commented 4 years ago

How exactly did you do the normalization? Did you verify if a perfect prediction results in 0 loss and a complete opposite prediction results in a high loss? When I experimented with this, the described method of normalizing in the paper did not seem to produce the behavior I just described.

rpautrat commented 4 years ago

I used tf.nn.l2_normalize to normalize both the descriptors and warped_descriptors in the descriptor_loss function. I didn't try it on simple testcases as you suggest, but I will.

rpautrat commented 4 years ago

@UCRajkumar, I computed the loss on a few very simple cases and the loss had always the expected behavior. On what example did you observe a strange behavior exactly?

zpfriedel commented 4 years ago

@rpautrat Saw that you implemented the normalization of the descriptors in your latest commit. Is it working better than before now? If so, that's awesome!

rpautrat commented 4 years ago

Yes, I implemented the normalization of the descriptors in the loss, and I added another normalization to the dot product of the descriptors to help disambiguate when several descriptors are very close. This was not in the original paper, but it improved the results for me. The new version with normalization is a bit better than before, you can see the results in the ReadMe.

shreyasr-upenn commented 6 months ago

Hi @rpautrat , I had a question on the descriptor loss. For the s computation, the paper chooses 8 as the distance between two adjacent cells (not pixels), and you chose 7.5. Why is it 8 or 7.5 in the first place? Two adjacent cells in the reduced dimensions are equivalent to the center pixels being 8 pixels apart. Having two cell distance under 8 is like pixels less than 64 pixels apart. Am I confusing the concepts here?

rpautrat commented 6 months ago

Hi, this is a minor change that should not have too much impact on the training. But our change could be understood with the following experiment: if you assume that H = identity, then with the paper formulation, the center of two neighboring cells will be exactly 8 pixels away, and they will be considered as a match (which should probably not be the case). We could make the inequality strict to avoid that, or use a threshold of 7.5 instead as we did.

shreyasr-upenn commented 6 months ago

Ah, I missed the part where the cell locations are multiplied to be original pixel centers, hence adjacent cells are 8 pixels apart. Thank you!

So, I tried training SuperPoint with the values of lambda_d and lambda_loss as per the paper, as as you'd know already, the performance is not good. I will put your values and train to see the performance. I wanted to get an intuition as to how you arrived at the numbers, and how to interpret the results from the training graph. I have attached some files to show outputs: image This is output of 10% of the matches. As you can see it is not correctly matching. image This is the loss graphs over 30 epochs at LR=0.0001. I have added CosineAnnealingLR in this case for 30 epochs. The descriptor losses are oscillating. Ignore the low values because it was mutiiplied with 1e-4 (lambda_loss).

I'm planning to implement raytune for the tuning so would like to know a good range of values to train on.

Thank you so much for your help!

rpautrat commented 6 months ago

Hi, we tuned the lambdas so that they could balance the positive and negative loss, as well as descriptor vs detector losses. This was manual trial and error overall.

There are a lot of good matches in the image you sent. So if it is only 10% of the matches, it seems that you have quite a few good matches already. Of course there are always outliers remaining. But you can always add geometric filtering (eg RANSAC to find a homography or fundamental matrix) to remove outliers.

Regarding the new stuff you are adding, I can't provide any help, since I haven't tried it myself.