verlab / accelerated_features

Implementation of XFeat (CVPR 2024). Do you need robust and fast local feature extraction? You are in the right place!
https://www.verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24
Apache License 2.0
859 stars 81 forks source link

WIP combining xfeat with steerers #32

Open georg-bn opened 1 month ago

georg-bn commented 1 month ago

I have a WIP of combining xfeat with steerers. I'm posting here in case someone else than me is interested in looking at this.

I've trained two versions of XFeat using this fork, one with a fixed permutation steerer and one with a learned steerer (this seems to be slightly better). A quick colab demo for the fixed steerer is here. Weights are here. I haven't really evaluated well (when will your evaluation code be released approximately?), but on a quick test on HPatches, the new versions seem comparable to the original on upright images.

Anyway, happy to discuss ideas in this thread.

ShuaiAlger commented 1 month ago

Hello Georg, your rotation-steerers is quite impressive. It achieves delightful rotation-equivariant performance without relying on a rotation-equivariant group-convolution backbone. Both LightGlue LightGlue / glue_factory and XFeat have utilized LO-RANSAC from poselib as the pose estimator, which significantly outperforms the cv2.RANSAC estimator commonly used before, especially in the comprehensive evaluation on MegaDepth. I have tested the perm version and the learn version of your steer on MegaDepth (origin/rot-C4/rot-rand). The results show that steer-XFeat exhibits delightful rotation-equivariant capabilities, but there is a slight performance drop when no rotation is applied.

The learn version of steer-XFeat performs slightly better than the original XFeat without rotation, but its rotation-equivariant ability is almost the same as the original XFeat, indicating that the learn version (xfeat_learn_steer.pth) of steer does not contribute to rotation-equivariant performance. Do you upload wrong weights of learned steer?

The following are the results I have currently run, and due to time constraints, I really want to share them as soon as possible. Therefore, some experimental results that do not affect judgment have not been provided yet.

# rot-C4 : random (0, 90, 180, 270)
# rot-SO2 : random [0, 360]
# ramdom : numpy.random, manual seed : 0
# The default sim threshold is 0.9 and the default matcher is fine-matcher

# xfeat-origin, origin ,   opencv :      28.15 46.99 64.49 82.09
# xfeat-origin, origin ,   poselib-LO :  50.20 65.40 77.10 --.-- (paper)
# xfeat-origin, origin ,   poselib-LO :  50.99 66.43 78.15 82.09 (re-eval)
# xfeat-origin, rot-C4 ,   opencv :       6.80 11.86 16.42 28.25
# xfeat-origin, rot-C4 ,   poselib-LO :  12.89 16.56 19.54 28.25
# xfeat-origin, rot-SO2,   opencv :     (Theoretically weaker than rot-C4)
# xfeat-origin, rot-SO2,   poselib-LO : (Theoretically weaker than rot-C4)

# steer-perm,   origin ,   opencv :      21.98 39.67 58.65 82.90
# steer-perm,   origin ,   poselib-LO :  40.01 56.93 71.02 82.90
# steer-perm,   rot-C4 ,   opencv :      21.06 37.57 55.58 80.79
# steer-perm,   rot-C4 ,   poselib-LO :  35.22 51.36 65.83 80.79
# steer-perm,   rot-SO2,   opencv :      15.73 29.22 45.14 67.24
# steer-perm,   rot-SO2,   poselib-LO :  25.05 39.56 53.62 67.24

# steer-learn,  origin ,   opencv :      27.60 45.66 62.14 78.34
# steer-learn,  origin ,   poselib-LO :  48.67 62.87 73.59 78.34
# steer-learn,  rot-C4 ,   opencv :     
# steer-learn,  rot-C4 ,   poselib-LO :  44.47 59.78 71.76 79.75
# steer-learn,  rot-SO2,   opencv :     
# steer-learn,  rot-SO2,   poselib-LO :  31.91 46.02 58.08 65.82

The following is the evaluation method I used, mainly derived from SuperGlue's evaluation code superglue_eval and the poselib library PoseLib.

# --- GEOMETRY ---
from poselib import estimate_relative_pose
def estimate_pose(kpts0, kpts1, K0, K1, thresh, w0, h0, w1, h1, USE_LO_RANSAC = 0, conf=0.99999):
    if len(kpts0) < 5:
        return None

    if USE_LO_RANSAC:

        mkpts0 = []
        for i in range(len(kpts0)):
            mkpts0.append(np.float64(np.asarray([kpts0[i, 0], kpts0[i, 1]]).reshape((2, 1))))
        mkpts1 = []
        for i in range(len(kpts1)):
            mkpts1.append(np.float64(np.asarray([kpts1[i, 0], kpts1[i, 1]]).reshape((2, 1))))

        cam0 = {"model": "PINHOLE",
                "params": np.asarray([K0[0, 0], K0[1, 1], K0[0, 2], K0[1, 2]]),
                "width": w0,
                "height": h0,
                }
        cam1 = {"model": "PINHOLE",
                "params": np.asarray([K1[0, 0], K1[1, 1], K1[0, 2], K1[1, 2]]),
                "width": w1,
                "height": h1,
                }

        M, info = estimate_relative_pose(mkpts0, mkpts1, cam0, cam1, {'max_reproj_error': 10}, {})
        ret = (M.R, M.t, 1)
        return ret
    else:
        f_mean = np.mean([K0[0, 0], K1[1, 1], K0[0, 0], K1[1, 1]])
        norm_thresh = thresh / f_mean

        kpts0 = (kpts0 - K0[[0, 1], [2, 2]][None]) / K0[[0, 1], [0, 1]][None]
        kpts1 = (kpts1 - K1[[0, 1], [2, 2]][None]) / K1[[0, 1], [0, 1]][None]
        E, mask = cv2.findEssentialMat(
            kpts0, kpts1, np.eye(3), threshold=norm_thresh, prob=conf,
            method=cv2.RANSAC)

        assert E is not None

        best_num_inliers = 0
        ret = None
        for _E in np.split(E, len(E) / 3):
            n, R, t, _ = cv2.recoverPose(
                _E, kpts0, kpts1, np.eye(3), 1e9, mask=mask)
            if n > best_num_inliers:
                best_num_inliers = n
                ret = (R, t[:, 0], mask.ravel() > 0)
        return ret

In fact, I have been trying to combine e2CNN and XFeat recently, and I found that this method is very inefficient. The reason is that XFeat emphasizes its real-time performance on the CPU side more, especially its shallow convolutional layers. The number of channels has increased from 1 to 4, and then from 4 to 8. However, based on the rotation equivariant backbone method, C8 is a more suitable sampling group (referring to ReF, group pooling efficiency is too low; Referring to RELF, C4 cannot provide strong support in the 45 degree direction. Therefore, if we want to replace XFeat directly with e2CNN, the channel change of the first convolution layer is actually 1->32, and so on. This method will lose the advantages of XFeat.

Your rotation-steerers, is really elegant and practical. I look forward to combining it with XFeat to become a more powerful RE keypoint extractor.

georg-bn commented 1 month ago

Hi Shuai,

Thanks for the eval! Here's a demo for the learned steerer: colab. I think one needs to tune the min_cossim_coarse threshold, perhaps.

Combining with e2cnn would definitely also be interesting. As you say, however, it may be difficult to create a well-performing small network.

I haven't trained an SO2-version of steerers+xfeat yet, but might try it in the future.

ShuaiAlger commented 1 month ago

Thank you, Georg. I am trying to use the learned steerer code you provided and found that I made a mistake using the perm-steerer code before. After using the correct learned steerer code, the performance is better than that of the perm-steerer, as you initially mentioned. I will update the results in the test table above as soon as possible.

guipotje commented 1 month ago

Hello @georg-bn and @ShuaiAlger,

@georg-bn, it's very nice to see steerers integrated into XFeat! Thank you for providing the code examples on how to use steerers with XFeat. I believe this will help a lot with problems where rotation invariance is needed and aligns well with maintaining efficiency instead of performing brute-force test-time augmentation :)

I should release the evaluation scripts by next week; I've been busy these past days. @ShuaiAlger, thank you for providing the results with the adapted SuperGlue evaluation script.

I'm also training XFeat + LightGlue (smaller model) and should release it soon, in case anyone is interested!

noahzn commented 3 weeks ago

Hi, @georg-bn thanks for the work! I am wondering how I should use a learned steerer pth file and use the descriptors to train lightglue, instead of using MNN?

georg-bn commented 3 weeks ago

Hi @noahzn,

I think there are different options, we discussed it a bit here: https://github.com/georg-bn/rotation-steerers/issues/3

If you are interested in working on this as a research project, let me know as I am also interested.

noahzn commented 3 weeks ago

@georg-bn Hi, thank you for your reply! Yes, I'm interested in this project. Now I'm training XFeat with a learned steerer.

noahzn commented 2 weeks ago

@georg-bn @guipotje In the training code of XFeat, the random in-plane rotation is limited to (-30,30), have you tried increase this number so that maybe it can deal with larger rotations?

guipotje commented 2 weeks ago

Hi @noahzn, from my experience it would generally improve in-plane rotation robustness but it will probably hurt performance for images that are upright. This is a trade-off, I believe for general upright images, [-30,30] is a good balance. It would make sense to use steerers if you know that your images may have large in-plane rotations.

noahzn commented 2 weeks ago

@guipotje thanks for your reply! My use case is a bit different. I'm matching a small patch to a larger one, for example, I have taken a uav image which is 500x500, and I want to match it with a larger patch 2000x2000. the target region can be anywhere on the larger patch, so it can be in the top left from (0, 0) to (500, 500), or in the bottom right from (1500, 1500) to (2000, 2000), etc. Do you think this is related to in-plane rotation, or translation?