opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
75.95k stars 55.62k forks source link

[GSOC] New camera model for stitching pipeline #6933

Closed hrnr closed 7 years ago

hrnr commented 7 years ago

Merge with extra: https://github.com/opencv/opencv_extra/pull/303

This PR contains all work for New camera model for stitching pipeline GSoC 2016 project.

GSoC Proposal

Stitching pipeline is a well established code in OpenCV. It provides good results for creating panoramas from camera captured images. Main limitation of stitching pipeline is its expected camera model (perspective transformation). Although this model is fine for many applications working with camera captured images, there are applications which aren't covered by current stitching pipeline.

New camera model

Due to physical constraints it is possible for some applications to expect much simpler transform with less degrees of freedom. Those are situations when input data are not subject to perspective transform. The transformation can be much simpler, such as affine transformation. Datasets considered here includes images captured by special hardware (such as book scanners[0] that tries hard to eliminate perspective), maps from laser scanning (produced from different starting points), preprocessed images (where perspective was compensated by other robust means, taking advantage of physical situation, e.g. for book scanners we would use data from calibration to compensate remaining perspective). In all those situations we would like to obtain image mosaic under affine transformation.

I'd like to introduce new camera model based on affine transformation to stitching pipeline. This would include:

I used approach based on affine transformation to merge maps produced by multiple robots [1] for my robotics project. It shows a good results. However, as mentioned earlier applications for this model are much broader than that.

Parallelism for FeaturesFinder

To make usage of stitching pipeline more comfortable and performant for large number of images, I’d like also to improve FeaturesFinder to allow finding features in parallel. All camera models and other users of FeaturesFinder may take benefit from that. The API could be similar to FeaturesMatcher::operator ()(features, pairwise_matches, mask).

This could be with TBB in similar manner as mentioned method in FeaturesMatcher, which is already being used in stitching pipeline so there would be almost no additional overhead in starting new threads in typical scenarios, because these threads are there already for FeaturesMatcher. This change would be fully integrated into high level stitching interface.

There might be some changes necessary in finders to ensure thread-safety. Where thread-safety can’t be ensured or it does not make sense (GPU finders), parallelization would be disabled and all images would be processed in serial manner so this method would be always safe to use regardless of underlying finder. This approach is also similar to FeaturesMatcher.

Benefits to OpenCV

short video presenting this project

other work

During this GSoC I have also coded some related work, that is not going to be included (mostly because we has chosen different approach or the work has been merged under this PR). It is listed here for completeness.

PRs:

commits:

hrnr commented 7 years ago

ORB is apparently thread-unsafe when running with OpenCL.

I have restarted the build once. Last time it passed 1 OCL stitching tests, this time it went through 2 of them. There seems to be a race condition.

Is ORB expected to be thread-unsafe with OCL or should this be fixed? I can disable my parallel feature finding changes with OCL, but I don't know if this is not an actual bug in ORB.

hrnr commented 7 years ago

I have disabled parallel feature finding when running with OpenCL. There is not much benefit, because it needs to wait for the device.

This solves the issues with ORB for me, but I'm still not sure if this should be opened as bug or not.

hrnr commented 7 years ago

rebase to catch latest changes in master (especially #6962)

hrnr commented 7 years ago

FYI: The build is not failing, there is only a warning that patch to opencv_extra is too big (~1MB). Is that a problem? I'd like to add more images.

alalek commented 7 years ago

I believe this is not a problem in this case. When PR will ready please squash all commits into one to keep patch changes clean.

mself commented 7 years ago

Jiri, I have some comments/suggestions that might be easier to discuss offline. Can you shoot me an email at dmz@mself.com to connect? I did some work to create variants of findHomography() that estimate constrained transformations that have 3 or 4 degrees of freedom rather than the 8 DOF of a full homography. These correspond to (3D) rotations only, and rotations plus uniform scaling. This is similar to your 4 DOF variant. I'd be interested in adding a 3 DOF version that only allows a (2D) rotation plus translation, for example. --Matthew

hrnr commented 7 years ago

ok. I think I have solved the the issue with coincident points. haveCollinearPoints also in fact check coincidence because when coincident, the check will essentialy be 0 <= FLT_EPSILON * (abs(dx2)+abs(dy2)) so it should report coincidence correctly.

I have reworked checkSubset for estimateAffine*, so that there is no duplicite code. Afterall functions should just be more robust.

hrnr commented 7 years ago

I have experimented with solving the system in affinePartial callback analytically. In my experiments it surprisingly runs slower than SVD version.

SVD:


calib3d_posix_x64_5074d25_20160811-122223.xml

                     Name of Test                           Number of     Number of   Min     Median  Geometric mean   Mean   Standard deviation
                                                        collected samples outliers                                                              
EstimateAffine2D::EstimateAffine::(100, 0.9)                   38             3     0.15 ms  0.15 ms     0.15 ms     0.15 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.95)                  38             3     0.17 ms  0.18 ms     0.18 ms     0.18 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.99)                  36             2     0.25 ms  0.25 ms     0.25 ms     0.25 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.9)                10             0     31.49 ms 31.63 ms    31.65 ms    31.65 ms      0.15 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.95)               10             0     34.32 ms 34.40 ms    34.47 ms    34.47 ms      0.25 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.99)               10             0     41.12 ms 41.34 ms    41.63 ms    41.64 ms      0.81 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.9)                  13             1     1.23 ms  1.23 ms     1.24 ms     1.24 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.95)                 13             1     1.39 ms  1.39 ms     1.40 ms     1.40 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.99)                 38             3     1.77 ms  1.78 ms     1.81 ms     1.81 ms       0.05 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9)            100            8     0.06 ms  0.06 ms     0.06 ms     0.06 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95)           75             6     0.07 ms  0.07 ms     0.07 ms     0.07 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99)           63             5     0.08 ms  0.08 ms     0.09 ms     0.09 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9)         10             0     17.02 ms 17.07 ms    17.10 ms    17.10 ms      0.08 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95)        10             0     17.59 ms 17.72 ms    17.73 ms    17.73 ms      0.12 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99)        10             0     20.04 ms 20.10 ms    20.19 ms    20.19 ms      0.22 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9)           13             1     0.50 ms  0.50 ms     0.50 ms     0.50 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95)          13             1     0.58 ms  0.58 ms     0.58 ms     0.58 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99)          13             1     0.71 ms  0.71 ms     0.72 ms     0.72 ms       0.02 ms      

analytic:


calib3d_posix_x64_5074d25_20160811-161048.xml

                     Name of Test                           Number of     Number of   Min     Median  Geometric mean   Mean   Standard deviation
                                                        collected samples outliers                                                              
EstimateAffine2D::EstimateAffine::(100, 0.9)                   63             5     0.15 ms  0.15 ms     0.15 ms     0.15 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.95)                  38             3     0.18 ms  0.18 ms     0.18 ms     0.18 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.99)                  33             2     0.25 ms  0.25 ms     0.26 ms     0.26 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.9)                10             0     31.36 ms 31.43 ms    31.56 ms    31.56 ms      0.32 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.95)               10             0     34.16 ms 34.27 ms    34.39 ms    34.39 ms      0.35 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.99)               10             0     41.20 ms 41.35 ms    42.09 ms    42.11 ms      1.15 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.9)                  10             0     1.21 ms  1.22 ms     1.23 ms     1.24 ms       0.03 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.95)                 10             0     1.36 ms  1.37 ms     1.38 ms     1.38 ms       0.03 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.99)                 13             1     1.76 ms  1.77 ms     1.79 ms     1.79 ms       0.05 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9)            42             3     0.05 ms  0.05 ms     0.05 ms     0.05 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95)           25             2     0.06 ms  0.06 ms     0.06 ms     0.06 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99)           42             3     0.07 ms  0.07 ms     0.07 ms     0.07 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9)         10             0     17.07 ms 17.15 ms    17.19 ms    17.20 ms      0.15 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95)        10             0     17.71 ms 17.84 ms    17.86 ms    17.86 ms      0.12 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99)        10             0     20.26 ms 20.34 ms    20.51 ms    20.51 ms      0.46 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9)           13             1     0.49 ms  0.49 ms     0.50 ms     0.50 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95)          11             0     0.57 ms  0.57 ms     0.58 ms     0.58 ms       0.02 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99)          19             1     0.69 ms  0.70 ms     0.71 ms     0.71 ms       0.02 ms      

you can find the code at 57bf2d46cd7b2911cd518a2bdbe745cb43f95da8

mself commented 7 years ago

That's surprising! If you're up for more experiments, I realized that you can solve the entire kernel analytically without even a matrix multiply. This should be even faster. I can't see how SVD could be faster than this!

        double x1 = from[0].x;
        double y1 = from[0].y;
        double x2 = from[1].x;
        double y2 = from[1].y;

        double X1 = to[0].x;
        double Y1 = to[0].y;
        double X2 = to[1].x;
        double Y2 = to[1].y;

        double d = 1./((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2));

        Xdata[0] = d * ( (X1-X2)*(x1-x2) + (Y1-Y2)*(y1-y2) );
        Xdata[1] = d * ( (Y1-Y2)*(x1-x2) - (X1-X2)*(y1-y2) );
        Xdata[2] = d * ( (Y1-Y2)*(x1*y2 - x2*y1) - (X1*y2 - X2*y1)*(y1-y2) - (X1*x2 - X2*x1)*(x1-x2) );
        Xdata[3] = d * (-(X1-X2)*(x1*y2 - x2*y1) - (Y1*x2 - Y2*x1)*(x1-x2) - (Y1*y2 - Y2*y1)*(y1-y2) );

The compiler should be able to optimize all of the common subexpressions and there are no function calls.

hrnr commented 7 years ago

Yep, I was also surprised. I think that the kernel is not a bottle neck for the function. But I will try your version, that seems even better.

I will optimize copying inliers, which is currently quite ineficient, that could also speed something up.

hrnr commented 7 years ago

I have updated the perf test to use the new API. It seems that lot of time is spent in Levenberg-Marquart refining. RANSAC is only about 1/2 or even 1/3 runtime. LMEDS takes much longer, faster kernel makes more sense here. Here are current results with SVD-based kernels:


calib3d_posix_x64_d9a138e_20160812-152109.xml

                           Name of Test                                 Number of     Number of    Min     Median   Geometric mean   Mean    Standard deviation
                                                                    collected samples outliers                                                                 
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                     13             1      0.07 ms   0.07 ms     0.07 ms      0.07 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                    73             5      0.10 ms   0.10 ms     0.10 ms      0.10 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                    38             3      0.11 ms   0.11 ms     0.11 ms      0.11 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)                   13             1      0.15 ms   0.16 ms     0.16 ms      0.16 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                    50             4      0.07 ms   0.08 ms     0.08 ms      0.08 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)                   63             5      0.10 ms   0.10 ms     0.10 ms      0.10 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)                   63             5      0.14 ms   0.14 ms     0.14 ms      0.14 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)                  25             2      0.17 ms   0.17 ms     0.17 ms      0.17 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                    25             2      0.12 ms   0.12 ms     0.13 ms      0.13 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)                   25             2      0.16 ms   0.16 ms     0.16 ms      0.16 ms       0.00 ms      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)                   13             1      0.21 ms   0.21 ms     0.21 ms      0.21 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)                  35             2      0.24 ms   0.24 ms     0.25 ms      0.25 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)                  10             0     79.69 ms  79.82 ms     79.93 ms    79.93 ms       0.36 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)                 10             0     137.52 ms 138.79 ms   138.58 ms    138.58 ms      0.54 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)                 10             0     10.13 ms  10.20 ms     10.25 ms    10.25 ms       0.17 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)                10             0     38.84 ms  39.01 ms     39.04 ms    39.04 ms       0.17 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)                 10             0     97.95 ms  98.12 ms     98.31 ms    98.31 ms       0.72 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)                10             0     156.45 ms 157.31 ms   157.32 ms    157.32 ms      0.81 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)                10             0     12.92 ms  13.03 ms     13.17 ms    13.17 ms       0.33 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)               10             0     41.73 ms  41.84 ms     42.12 ms    42.12 ms       0.44 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)                 10             0     152.33 ms 152.64 ms   153.70 ms    153.73 ms      2.95 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)                10             0     174.49 ms 174.91 ms   175.25 ms    175.25 ms      1.00 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)                10             0     20.06 ms  20.22 ms     20.33 ms    20.33 ms       0.29 ms      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)               10             0     48.78 ms  49.49 ms     49.80 ms    49.82 ms       1.47 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                    10             0      2.98 ms   2.99 ms     3.00 ms      3.00 ms       0.02 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)                   10             0      4.25 ms   4.37 ms     4.36 ms      4.36 ms       0.08 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)                   10             0      0.56 ms   0.56 ms     0.57 ms      0.57 ms       0.02 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)                  13             1      1.21 ms   1.21 ms     1.22 ms      1.22 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)                   10             0      3.65 ms   3.67 ms     3.68 ms      3.68 ms       0.03 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)                  10             0      4.82 ms   4.86 ms     4.86 ms      4.86 ms       0.03 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)                  10             0      0.72 ms   0.72 ms     0.73 ms      0.73 ms       0.02 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)                 13             1      1.36 ms   1.37 ms     1.38 ms      1.38 ms       0.01 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)                   10             0      5.71 ms   5.72 ms     5.74 ms      5.74 ms       0.03 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)                  10             0      6.88 ms   6.92 ms     6.93 ms      6.93 ms       0.06 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)                  10             0      1.11 ms   1.11 ms     1.12 ms      1.12 ms       0.02 ms      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)                 13             1      1.75 ms   1.76 ms     1.77 ms      1.77 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)              13             1      0.02 ms   0.02 ms     0.03 ms      0.03 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)             100            8      0.05 ms   0.06 ms     0.06 ms      0.06 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)             13             1      0.03 ms   0.03 ms     0.03 ms      0.03 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)            75             6      0.06 ms   0.06 ms     0.06 ms      0.06 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)             25             2      0.03 ms   0.03 ms     0.03 ms      0.03 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)            88             7      0.06 ms   0.06 ms     0.06 ms      0.06 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)            63             5      0.04 ms   0.04 ms     0.04 ms      0.04 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)           63             5      0.07 ms   0.07 ms     0.07 ms      0.07 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)             13             1      0.05 ms   0.05 ms     0.05 ms      0.05 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)            100            8      0.06 ms   0.07 ms     0.07 ms      0.07 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)            38             3      0.05 ms   0.06 ms     0.06 ms      0.06 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)           50             4      0.08 ms   0.08 ms     0.08 ms      0.08 ms       0.00 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)           10             0     36.71 ms  36.79 ms     36.98 ms    36.99 ms       0.44 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)          10             0     53.86 ms  53.94 ms     54.06 ms    54.06 ms       0.28 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)          13             1      3.89 ms   3.92 ms     3.93 ms      3.93 ms       0.04 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)         10             0     14.11 ms  14.25 ms     14.28 ms    14.28 ms       0.16 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)          10             0     48.91 ms  48.98 ms     49.02 ms    49.02 ms       0.15 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)         10             0     66.05 ms  66.24 ms     66.33 ms    66.33 ms       0.32 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)         13             1      5.02 ms   5.05 ms     5.06 ms      5.06 ms       0.03 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)        10             0     15.26 ms  15.36 ms     15.40 ms    15.41 ms       0.16 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)          10             0     79.02 ms  79.15 ms     79.19 ms    79.19 ms       0.18 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)         10             0     96.13 ms  96.40 ms     96.44 ms    96.44 ms       0.28 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)         10             0      7.54 ms   7.59 ms     7.66 ms      7.66 ms       0.22 ms      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)        10             0     17.79 ms  18.22 ms     18.26 ms    18.27 ms       0.51 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)             10             0      1.37 ms   1.38 ms     1.39 ms      1.39 ms       0.02 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)            10             0      1.84 ms   1.85 ms     1.86 ms      1.86 ms       0.05 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)            13             1      0.20 ms   0.20 ms     0.21 ms      0.21 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)           25             2      0.60 ms   0.60 ms     0.61 ms      0.61 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)            10             0      1.83 ms   1.90 ms     1.90 ms      1.90 ms       0.03 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)           10             0      2.40 ms   2.45 ms     2.45 ms      2.45 ms       0.04 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)           13             1      0.26 ms   0.26 ms     0.27 ms      0.27 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)          14             1      0.66 ms   0.66 ms     0.67 ms      0.67 ms       0.02 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)            10             0      2.96 ms   3.01 ms     3.02 ms      3.03 ms       0.05 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)           10             0      3.35 ms   3.47 ms     3.45 ms      3.45 ms       0.06 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)           10             0      0.39 ms   0.40 ms     0.40 ms      0.40 ms       0.01 ms      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)          13             1      0.79 ms   0.80 ms     0.80 ms      0.80 ms       0.02 ms      
mself commented 7 years ago

OK, cool. Perhaps the reason that findHomography() has the final runKernel() call that passes all of the consensus points is to improve the starting estimate for LM so that it takes fewer iterations. I noted that it took ~3 with the runKernel() call and ~5 without it. The results appeared to be virtually identical, so it may be about performance rather than accuracy. Or maybe it is about stability, since findHomography() is a lot less numerically stable.

mself commented 7 years ago

Now that you've updated the APIs, it would be interesting to compare the performance of the analytic runKernel() with niters = 0 to remove the LM part.

hrnr commented 7 years ago

Yes, I will definitely try that. I think it could speed up LMEDS as it takes much more time that RANSAC.

mself commented 7 years ago

I'm not so sure. I think that LMedS makes roughly the same number of calls to runKernel() as RANSAC does. I think LMedS is slower because calculating the median error is slower than calculating the average error. I did notice that LMeDSPointSetRegistrator::run() calls std::sort() rather than std::nth_element(), which could be much faster for large point sets since it only does a partial sort instead of a full sort. It's order n rather than n log n. For 64 points it could be ~5x faster (although probably less in practice). Probably worth measuring.

Bottom line, while it was nice to make runKernel() a lot faster, it doesn't appear to be a major factor in the performance of EstimateAffinePartial2D().

hrnr commented 7 years ago

I have tested the analytical version of kernels. Aligned with previous test, the kernels does not seems to be the bottle neck for functions, but the analytical version is slightly faster. The analytical version is based on suggestions of @mself (thank you), with some typos fixed. I have tuned kernels so that more can be stored in registers and avoid copying model.

I think we can include this. I have extended tests to make sure it is still correct and added some extensive comments to explain what is happening in kernels.

Commits to come.

Geometric mean

                           Name of Test                                 calib3d         calib3d         calib3d    
                                                                         posix           posix           posix     
                                                                          x64             x64             x64      
                                                                        d9a138e         2539bf1         2539bf1    
                                                                    20160812-152109 20160814-114732 20160814-114732
                                                                                                          vs       
                                                                                                        calib3d    
                                                                                                         posix     
                                                                                                          x64      
                                                                                                        d9a138e    
                                                                                                    20160812-152109
                                                                                                      (x-factor)   
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                 0.068 ms        0.032 ms          2.10      
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                0.104 ms        0.074 ms          1.40      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                0.110 ms        0.029 ms          3.84      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)               0.156 ms        0.068 ms          2.30      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                0.076 ms        0.040 ms          1.89      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)               0.100 ms        0.059 ms          1.70      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)               0.139 ms        0.036 ms          3.89      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)              0.174 ms        0.072 ms          2.40      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                0.126 ms        0.069 ms          1.83      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)               0.160 ms        0.092 ms          1.73      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)               0.212 ms        0.052 ms          4.05      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)              0.247 ms        0.083 ms          2.96      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                2.998 ms        2.957 ms          1.01      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)               4.360 ms        4.072 ms          1.07      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)               0.569 ms        0.485 ms          1.17      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)              1.217 ms        1.179 ms          1.03      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)               3.681 ms        3.634 ms          1.01      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)              4.855 ms        4.742 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)              0.727 ms        0.618 ms          1.18      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)             1.377 ms        1.308 ms          1.05      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)               5.740 ms        5.670 ms          1.01      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)              6.927 ms        6.775 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)              1.118 ms        0.954 ms          1.17      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)             1.767 ms        1.643 ms          1.08      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)              79.927 ms       79.734 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)            138.580 ms      137.715 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)             10.246 ms       9.905 ms          1.03      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)            39.036 ms       38.296 ms         1.02      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)             98.311 ms       97.922 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)           157.320 ms      155.373 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)            13.169 ms       12.682 ms         1.04      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)           42.121 ms       41.160 ms         1.02      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)            153.704 ms      152.232 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)           175.252 ms      174.557 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)            20.325 ms       19.530 ms         1.04      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)           49.804 ms       48.300 ms         1.03      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)          0.025 ms        0.014 ms          1.80      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)         0.056 ms        0.047 ms          1.20      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)         0.034 ms        0.013 ms          2.58      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)        0.060 ms        0.042 ms          1.41      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)         0.032 ms        0.018 ms          1.77      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)        0.063 ms        0.050 ms          1.27      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)        0.039 ms        0.015 ms          2.60      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)       0.067 ms        0.042 ms          1.59      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)         0.052 ms        0.030 ms          1.75      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)        0.066 ms        0.047 ms          1.40      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)        0.056 ms        0.020 ms          2.76      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)       0.085 ms        0.048 ms          1.76      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)         1.389 ms        1.375 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)        1.864 ms        1.989 ms          0.94      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)        0.208 ms        0.186 ms          1.11      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)       0.607 ms        0.618 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)        1.901 ms        1.819 ms          1.05      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)       2.454 ms        2.422 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)       0.265 ms        0.241 ms          1.10      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)      0.671 ms        0.663 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)        3.025 ms        2.956 ms          1.02      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)       3.445 ms        3.256 ms          1.06      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)       0.403 ms        0.364 ms          1.11      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)      0.805 ms        0.783 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)       36.983 ms       36.766 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)      54.060 ms       59.627 ms         0.91      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)      3.930 ms        3.819 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)     14.283 ms       13.964 ms         1.02      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)      49.024 ms       48.956 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)     66.325 ms       71.767 ms         0.92      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)     5.056 ms        4.918 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)    15.405 ms       15.012 ms         1.03      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)      79.194 ms       79.118 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)     96.442 ms      102.042 ms         0.95      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)     7.659 ms        7.415 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)    18.260 ms       17.829 ms         1.02      
mself commented 7 years ago

That's great! Thank you for integrating this. It makes sense that the perf improvement is only apparent when the number of points is small. When there are a large number of points, the time is spent evaluating the error of the model rather than generating the model. In my application, the number of points is always <= 200, so this improvement is quite significant.

mself commented 7 years ago

A much larger performance improvement can be made for LMedS in LMeDSPointSetRegistrator::run() by replacing

                std::sort(errf.ptr<int>(), errf.ptr<int>() + count);

                double median = count % 2 != 0 ?
                errf.at<float>(count/2) : (errf.at<float>(count/2-1) + errf.at<float>(count/2))*0.5;

with

                std::nth_element(errf.ptr<int>(), errf.ptr<int>() + count/2, errf.ptr<int>() + count);
                double median = errf.at<float>(count/2);

It reduces the run time from n log n to n, so it has the most impact on large points sets. It makes LMedS run up to 5x faster for the largest perf test:

Geometric mean

                                          Name of Test                                               calib3d         calib3d         calib3d    
                                                                                                      posix           posix           posix     
                                                                                                       x64             x64             x64      
                                                                                                     86e6f89         86e6f89         86e6f89    
                                                                                                 20160814-152408 20160814-153654 20160814-153654
                                                                                                                                       vs       
                                                                                                                                     calib3d    
                                                                                                                                      posix     
                                                                                                                                       x64      
                                                                                                                                     86e6f89    
                                                                                                                                 20160814-152408
                                                                                                                                   (x-factor)   
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                                              0.025 ms        0.014 ms          1.73      
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                                             0.057 ms        0.045 ms          1.26      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                                             0.026 ms        0.026 ms          1.00      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)                                            0.050 ms        0.050 ms          1.00      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                                             0.031 ms        0.017 ms          1.86      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)                                            0.047 ms        0.031 ms          1.49      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)                                            0.031 ms        0.032 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)                                           0.055 ms        0.056 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                                             0.054 ms        0.024 ms          2.22      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)                                            0.071 ms        0.039 ms          1.80      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)                                            0.046 ms        0.046 ms          1.00      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)                                           0.070 ms        0.070 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                                             2.557 ms        0.639 ms          4.00      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)                                            3.544 ms        1.694 ms          2.09      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)                                            0.437 ms        0.440 ms          0.99      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)                                           0.988 ms        0.990 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)                                            3.116 ms        0.773 ms          4.03      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)                                           4.101 ms        1.808 ms          2.27      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)                                           0.553 ms        0.556 ms          0.99      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)                                          1.098 ms        1.120 ms          0.98      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)                                            4.861 ms        1.207 ms          4.03      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)                                           5.808 ms        2.210 ms          2.63      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)                                           0.842 ms        0.852 ms          0.99      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)                                          1.418 ms        1.436 ms          0.99      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)                                           68.418 ms       12.655 ms         5.41      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)                                         120.279 ms       63.391 ms         1.90      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)                                          8.614 ms        8.565 ms          1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)                                         23.824 ms       23.092 ms         1.03      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)                                          87.441 ms       15.609 ms         5.60      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)                                        140.613 ms       67.021 ms         2.10      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)                                         10.770 ms       10.712 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)                                        26.324 ms       25.426 ms         1.04      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)                                         132.352 ms       22.916 ms         5.78      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)                                        146.041 ms       37.414 ms         3.90      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)                                         15.795 ms       15.887 ms         0.99      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)                                        30.257 ms       31.005 ms         0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)                                       0.012 ms        0.009 ms          1.36      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)                                      0.035 ms        0.032 ms          1.10      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)                                      0.012 ms        0.012 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)                                     0.032 ms        0.032 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)                                      0.015 ms        0.011 ms          1.43      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)                                     0.038 ms        0.033 ms          1.15      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)                                     0.014 ms        0.014 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)                                    0.034 ms        0.035 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)                                      0.024 ms        0.014 ms          1.67      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)                                     0.036 ms        0.025 ms          1.43      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)                                     0.019 ms        0.019 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)                                    0.039 ms        0.039 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                                      1.255 ms        0.321 ms          3.91      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)                                     1.750 ms        0.873 ms          2.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)                                     0.182 ms        0.183 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)                                    0.533 ms        0.539 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)                                     1.621 ms        0.402 ms          4.04      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)                                    2.141 ms        0.956 ms          2.24      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)                                    0.229 ms        0.230 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)                                   0.580 ms        0.587 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)                                     2.560 ms        0.633 ms          4.04      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)                                    2.809 ms        0.895 ms          3.14      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)                                    0.346 ms        0.337 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)                                   0.685 ms        0.685 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)                                    31.660 ms       5.983 ms          5.29      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)                                   46.766 ms       19.892 ms         2.35      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)                                   3.758 ms        3.526 ms          1.07      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)                                  9.754 ms        9.619 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)                                   43.937 ms       7.550 ms          5.82      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)                                  56.864 ms       21.212 ms         2.68      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)                                  4.469 ms        4.478 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)                                 10.243 ms       10.479 ms         0.98      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)                                   67.980 ms       12.035 ms         5.65      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)                                  81.435 ms       25.899 ms         3.14      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)                                  6.347 ms        6.354 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)                                 12.225 ms       12.532 ms         0.98      

With the change, LMedS is never more than ~2x slower than RANSAC. For the 100 point tests, it's now faster than RANSAC.

mself commented 7 years ago

You can also get a 5-10% overall speedup by changing Affine2DEstimatorCallback::computeError() to use float instead of double for the intermediate results. The result is a float, in any case.

Note that HomographyEstimatorCallback::computeError() already uses float like this.

        float F0 = F[0], F1 = F[1], F2 = F[2], F3 = F[3], F4 = F[4], F5 = F[5];

        for(int i = 0; i < count; i++ )
        {
            const Point2f& f = from[i];
            const Point2f& t = to[i];

            float a = F0*f.x + F1*f.y + F2 - t.x;
            float b = F3*f.x + F4*f.y + F5 - t.y;

            errptr[i] = a*a + b*b;
        }

Is there a way that this could be vectorized with SSE? That could make a really significant difference.

mself commented 7 years ago

I tried writing an SSE version of Affine2DEstimatorCallback ::computeError()', since it seems to be the bottleneck forestimateAffine2D(). The SSE version increases the overall performance ofestimateAffine2D()by 10-20% in most cases compared to thefloat` version above. In some cases, it increased overall performance by 2x!

    void computeError( InputArray _m1, InputArray _m2, InputArray _model, OutputArray _err ) const
    {
        Mat m1 = _m1.getMat(), m2 = _m2.getMat(), model = _model.getMat();
        const Point2f* from = m1.ptr<Point2f>();
        const Point2f* to   = m2.ptr<Point2f>();
        const double* F = model.ptr<double>();

        int count = m1.checkVector(2);
        CV_Assert( count > 0 );

        _err.create(count, 1, CV_32F);
        Mat err = _err.getMat();
        float* errptr = err.ptr<float>();

        float F0 = F[0], F1 = F[1], F2 = F[2], F3 = F[3], F4 = F[4], F5 = F[5];

#if CV_SSE2
        if( checkHardwareSupport(CV_CPU_SSE2))
        {
            int i;

            // Load 4 copies of each model param into registers
            const __m128 mm_F0 = _mm_set1_ps(F0), mm_F1 = _mm_set1_ps(F1), mm_F2 = _mm_set1_ps(F2);
            const __m128 mm_F3 = _mm_set1_ps(F3), mm_F4 = _mm_set1_ps(F4), mm_F5 = _mm_set1_ps(F5);

            if ((( (intptr_t)from & 0xf ) == 0) && (( (intptr_t)to & 0xf ) == 0) && (( (intptr_t)errptr & 0xf ) == 0))
            {
                // Aligned case - use _mm_load_ps() and _mm_store_ps()
                for(i = 0; i < count - 3; i += 4 )
                {
                    // Load 4 'from' points into two registers
                    const __m128 mm_from_0 = _mm_load_ps(&from[i].x);
                    const __m128 mm_from_2 = _mm_load_ps(&from[i+2].x);

                    // Shuffle the x values into one register and the y values into another
                    const __m128 mm_fx = _mm_shuffle_ps(mm_from_0, mm_from_2, _MM_SHUFFLE(2, 0, 2, 0));
                    const __m128 mm_fy = _mm_shuffle_ps(mm_from_0, mm_from_2, _MM_SHUFFLE(3, 1, 3, 1));

                    // Repeat for the 'to' points
                    const __m128 mm_to_0 = _mm_load_ps(&to[i].x);
                    const __m128 mm_to_2 = _mm_load_ps(&to[i+2].x);
                    const __m128 mm_tx = _mm_shuffle_ps(mm_to_0, mm_to_2, _MM_SHUFFLE(2, 0, 2, 0));
                    const __m128 mm_ty = _mm_shuffle_ps(mm_to_0, mm_to_2, _MM_SHUFFLE(3, 1, 3, 1));

                    // Compute error for 4 points at a time
                    const __m128 mm_a = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F0, mm_fx),
                                                              _mm_mul_ps(mm_F1, mm_fy)),
                                                   _mm_sub_ps(mm_F2, mm_tx));

                    const __m128 mm_b = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F3, mm_fx),
                                                              _mm_mul_ps(mm_F4, mm_fy)),
                                                   _mm_sub_ps(mm_F5, mm_ty));

                    // Store 4 results
                    _mm_store_ps(&errptr[i], _mm_add_ps(_mm_mul_ps(mm_a, mm_a), _mm_mul_ps(mm_b, mm_b)));
                }
            }
            else
            {
                // Unaligned case - use _mm_loadu_ps() and _mm_storeu_ps()
                for(i = 0; i < count - 3; i += 4 )
                {
                    const __m128 mm_from01 = _mm_loadu_ps(&from[i].x);
                    const __m128 mm_from23 = _mm_loadu_ps(&from[i+2].x);
                    const __m128 mm_fx = _mm_shuffle_ps(mm_from01, mm_from23, _MM_SHUFFLE(2, 0, 2, 0));
                    const __m128 mm_fy = _mm_shuffle_ps(mm_from01, mm_from23, _MM_SHUFFLE(3, 1, 3, 1));

                    const __m128 mm_to01 = _mm_loadu_ps(&to[i].x);
                    const __m128 mm_to23 = _mm_loadu_ps(&to[i+2].x);
                    const __m128 mm_tx = _mm_shuffle_ps(mm_to01, mm_to23, _MM_SHUFFLE(2, 0, 2, 0));
                    const __m128 mm_ty = _mm_shuffle_ps(mm_to01, mm_to23, _MM_SHUFFLE(3, 1, 3, 1));

                    const __m128 mm_a = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F0, mm_fx),
                                                              _mm_mul_ps(mm_F1, mm_fy)),
                                                   _mm_sub_ps(mm_F2, mm_tx));

                    const __m128 mm_b = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F3, mm_fx),
                                                              _mm_mul_ps(mm_F4, mm_fy)),
                                                   _mm_sub_ps(mm_F5, mm_ty));

                    _mm_storeu_ps(&errptr[i], _mm_add_ps(_mm_mul_ps(mm_a, mm_a), _mm_mul_ps(mm_b, mm_b)));
                }
            }

            // Finish any remaining points
            for( ; i < count; i++ )
            {
                const Point2f& f = from[i];
                const Point2f& t = to[i];

                float a = F0*f.x + F1*f.y + F2 - t.x;
                float b = F3*f.x + F4*f.y + F5 - t.y;

                errptr[i] = a*a + b*b;
            }
        }
        else
#endif
        {
            for(int i = 0; i < count; i++ )
            {
                const Point2f& f = from[i];
                const Point2f& t = to[i];

                float a = F0*f.x + F1*f.y + F2 - t.x;
                float b = F3*f.x + F4*f.y + F5 - t.y;

                errptr[i] = a*a + b*b;
            }
        }
    }

Here are the perf results:

Geometric mean

                           Name of Test                                calib3d         calib3d         calib3d         calib3d    
                                                                        posix           posix           posix           posix     
                                                                         x64             x64             x64             x64      
                                                                       86e6f89         86e6f89         86e6f89         86e6f89    
                                                                    20160814-nosse 20160815-005157 20160815-005157 20160815-005157
                                                                                                         vs              vs       
                                                                                                       calib3d         calib3d    
                                                                                                        posix           posix     
                                                                                                         x64             x64      
                                                                                                       86e6f89         86e6f89    
                                                                                                   20160814-nosse  20160814-nosse 
                                                                                                     (x-factor)        (score)    
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                 0.014 ms       0.013 ms          1.12           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                0.042 ms       0.041 ms          1.03                      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                0.024 ms       0.019 ms          1.22           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)               0.047 ms       0.044 ms          1.08           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                0.016 ms       0.014 ms          1.16           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)               0.031 ms       0.029 ms          1.08           faster     
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)               0.029 ms       0.023 ms          1.22           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)              0.053 ms       0.048 ms          1.10           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                0.024 ms       0.020 ms          1.19           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)               0.039 ms       0.036 ms          1.08           faster     
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)               0.042 ms       0.034 ms          1.24           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)              0.067 ms       0.058 ms          1.14           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                0.578 ms       0.519 ms          1.11           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)               1.618 ms       1.552 ms          1.04                      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)               0.393 ms       0.197 ms          1.99           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)              0.940 ms       0.746 ms          1.26           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)               0.730 ms       0.626 ms          1.17           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)              1.757 ms       1.674 ms          1.05                      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)              0.495 ms       0.245 ms          2.02           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)             1.045 ms       0.793 ms          1.32           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)               1.141 ms       0.978 ms          1.17           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)              1.612 ms       1.478 ms          1.09           faster     
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)              0.756 ms       0.364 ms          2.08           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)             1.303 ms       0.916 ms          1.42           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)             11.808 ms       10.525 ms         1.12           faster     
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)            59.428 ms       58.376 ms         1.02                      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)             7.486 ms       3.913 ms          1.91           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)           22.308 ms       18.235 ms         1.22           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)            14.539 ms       13.032 ms         1.12           faster     
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)           64.788 ms       64.195 ms         1.01                      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)            9.245 ms       4.714 ms          1.96           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)          24.037 ms       20.133 ms         1.19           faster     
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)            21.544 ms       19.227 ms         1.12           faster     
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)           35.347 ms       33.207 ms         1.06           faster     
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)           13.720 ms       6.850 ms          2.00           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)          28.917 ms       21.459 ms         1.35           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)          0.009 ms       0.008 ms          1.14           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)         0.032 ms       0.031 ms          1.04                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)         0.011 ms       0.010 ms          1.16           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)        0.031 ms       0.030 ms          1.05           faster     
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)         0.011 ms       0.009 ms          1.16           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)        0.033 ms       0.032 ms          1.05                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)        0.013 ms       0.011 ms          1.17           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)       0.033 ms       0.031 ms          1.05           faster     
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)         0.014 ms       0.012 ms          1.15           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)        0.026 ms       0.025 ms          1.07           faster     
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)        0.018 ms       0.015 ms          1.21           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)       0.038 ms       0.035 ms          1.09           faster     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)         0.305 ms       0.260 ms          1.17           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)        0.834 ms       0.804 ms          1.04                      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)        0.167 ms       0.093 ms          1.80           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)       0.514 ms       0.447 ms          1.15           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)        0.385 ms       0.330 ms          1.17           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)       0.914 ms       0.863 ms          1.06           faster     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)       0.207 ms       0.112 ms          1.85           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)      0.557 ms       0.459 ms          1.21           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)        0.611 ms       0.517 ms          1.18           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)       0.857 ms       0.763 ms          1.12           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)       0.300 ms       0.155 ms          1.94           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)      0.654 ms       0.508 ms          1.29           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)       5.817 ms       5.101 ms          1.14           faster     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)     19.439 ms       18.683 ms         1.04                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)      3.140 ms       1.789 ms          1.76           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)     9.077 ms       7.911 ms          1.15           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)      7.418 ms       6.648 ms          1.12           faster     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)    21.130 ms       19.977 ms         1.06                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)     3.923 ms       2.171 ms          1.81           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)    9.810 ms       8.307 ms          1.18           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)     11.445 ms       10.283 ms         1.11           faster     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)    25.540 ms       23.906 ms         1.07           faster     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)     5.537 ms       2.965 ms          1.87           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)   11.431 ms       9.099 ms          1.26           FASTER     

I haven't written an SSE function before, so someone with more experience might be able to improve it. I didn't see much support for AVX in OpenCV, but that could increase the stride from 4 to 8 points per iteration.

hrnr commented 7 years ago

LMEDS changes looks great, sorting just for median is insane. This is definitely for another PR. This might finally make it usable. I think nobody noticed just because there are no perf tests for findHomograhy and friends using LMEDS.

Error computing using float seems ok, I tested something and it didn't have any impact on precision, at least in my case. For vectorized version, I think we should use universal intristics (CV_SIMD128) to also support NEON.

hrnr commented 7 years ago

I was playing with vectorized version and I can't reproduce your results. I got this for your version.


Geometric mean

                           Name of Test                                 calib3d         calib3d         calib3d    
                                                                         posix           posix           posix     
                                                                          x64             x64             x64      
                                                                        86e6f89         18b2e2d         18b2e2d    
                                                                    20160815-165658 20160815-171110 20160815-171110
                                                                                                          vs       
                                                                                                        calib3d    
                                                                                                         posix     
                                                                                                          x64      
                                                                                                        86e6f89    
                                                                                                    20160815-165658
                                                                                                      (x-factor)   
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                 0.028 ms        0.029 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                0.066 ms        0.070 ms          0.94      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                0.022 ms        0.023 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)               0.055 ms        0.055 ms          1.00      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                0.032 ms        0.036 ms          0.91      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)               0.054 ms        0.054 ms          1.00      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)               0.027 ms        0.028 ms          0.97      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)              0.059 ms        0.064 ms          0.92      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                0.063 ms        0.064 ms          0.99      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)               0.081 ms        0.086 ms          0.94      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)               0.039 ms        0.040 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)              0.071 ms        0.072 ms          0.98      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                2.776 ms        2.786 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)               3.940 ms        3.947 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)               0.244 ms        0.284 ms          0.86      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)              0.911 ms        0.914 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)               3.393 ms        3.400 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)              4.557 ms        4.701 ms          0.97      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)              0.302 ms        0.351 ms          0.86      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)             0.966 ms        1.033 ms          0.94      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)               5.320 ms        5.320 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)              5.918 ms        5.905 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)              0.442 ms        0.492 ms          0.90      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)             1.099 ms        1.121 ms          0.98      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)              81.415 ms       81.188 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)            137.124 ms      139.951 ms         0.98      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)             5.078 ms        5.587 ms          0.91      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)            33.010 ms       34.542 ms         0.96      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)             99.826 ms       99.712 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)           155.981 ms      158.494 ms         0.98      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)            6.331 ms        6.866 ms          0.92      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)           34.323 ms       35.480 ms         0.97      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)            151.042 ms      150.811 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)           173.142 ms      173.321 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)            9.441 ms        10.335 ms         0.91      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)           37.718 ms       40.089 ms         0.94      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)          0.013 ms        0.013 ms          1.05      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)         0.045 ms        0.044 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)         0.012 ms        0.010 ms          1.11      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)        0.038 ms        0.040 ms          0.95      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)         0.016 ms        0.017 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)        0.048 ms        0.048 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)        0.013 ms        0.012 ms          1.06      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)       0.040 ms        0.040 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)         0.027 ms        0.027 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)        0.044 ms        0.045 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)        0.017 ms        0.016 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)       0.044 ms        0.044 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)         1.242 ms        1.249 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)        1.811 ms        1.890 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)        0.116 ms        0.125 ms          0.92      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)       0.520 ms        0.495 ms          1.05      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)        1.712 ms        1.704 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)       2.265 ms        2.377 ms          0.95      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)       0.139 ms        0.151 ms          0.92      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)      0.544 ms        0.522 ms          1.04      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)        2.764 ms        2.769 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)       3.057 ms        3.093 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)       0.190 ms        0.210 ms          0.91      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)      0.595 ms        0.578 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)       35.952 ms       35.912 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)      58.530 ms       58.241 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)      2.346 ms        2.528 ms          0.93      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)     12.419 ms       12.831 ms         0.97      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)      48.274 ms       48.219 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)     70.782 ms       70.575 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)     2.852 ms        3.058 ms          0.93      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)    12.931 ms       13.398 ms         0.97      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)      77.203 ms       77.853 ms         0.99      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)     99.982 ms       99.386 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)     3.953 ms        4.266 ms          0.93      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)    14.084 ms       14.528 ms         0.97      

which is not very convincing.

Probably my compiler could vectorize the original loop better than we humans? I'm using gcc 6.1.1. Or could this be my machine?

I have implemented my vectorized version, but that is also slower 18b2e2de112f7780e604caa99b3ac6da685421ff.

mself commented 7 years ago

I re-ran the perf test to double check and got the same results. I am running on a Mac with an Intel Core i5. I used gdb to disassemble the plain version and it was not using any vector instructions. Perhaps that's not enabled for me (or the compiler is choosing not to vectorize this loop for some reason).

So either the plain version is faster on your compiler so there is no difference for you, or the vector version isn't running properly for you. Is ENABLE_SSE2 set in CMake on your system? I tried putting a couple of printfs into the vector version, which verified that the aligned version was always being used for me.

hrnr commented 7 years ago

interesting, can you get also some improvement with 18b2e2de112f7780e604caa99b3ac6da685421ff version?

I will try to test on different machine with different compiler.

mself commented 7 years ago

I ran your SIMD version and got a similar speed boost to what I got from the SSE version (up to 2x faster!). I also made a tweak to it (see the line comments) that helped another ~10% on some tests. If that version is no slower for you, then perhaps we can include it for those users who are seeing the benefit. It would be good to understand why you don't see any boost.

hrnr commented 7 years ago

Ok I have tested also on Xeon E7540 with older gcc 4.9.3. But even the older gcc can vectorize the loop for me I got roughly:

 175:   f2 41 0f 5a 5d 08       cvtsd2ss 0x8(%r13),%xmm3
 17b:   66 45 0f ef c0          pxor   %xmm8,%xmm8
 180:   f2 41 0f 5a 6d 10       cvtsd2ss 0x10(%r13),%xmm5
 186:   f2 41 0f 5a 75 18       cvtsd2ss 0x18(%r13),%xmm6
 18c:   f2 41 0f 5a 7d 20       cvtsd2ss 0x20(%r13),%xmm7
 192:   f2 45 0f 5a 45 28       cvtsd2ss 0x28(%r13),%xmm8
 198:   0f 8e 6f 02 00 00       jle    40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
 19e:   49 63 cc                movslq %r12d,%rcx
 1a1:   48 8d 04 cd 00 00 00    lea    0x0(,%rcx,8),%rax
 1a8:   00 
 1a9:   48 8d 34 8a             lea    (%rdx,%rcx,4),%rsi
 1ad:   48 8d 0c 03             lea    (%rbx,%rax,1),%rcx
 1b1:   48 39 f3                cmp    %rsi,%rbx
 1b4:   40 0f 93 c7             setae  %dil
 1b8:   48 39 ca                cmp    %rcx,%rdx
 1bb:   0f 93 c1                setae  %cl
 1be:   09 f9                   or     %edi,%ecx
 1c0:   48 8b bd 28 fe ff ff    mov    -0x1d8(%rbp),%rdi
 1c7:   48 39 f7                cmp    %rsi,%rdi
 1ca:   40 0f 93 c6             setae  %sil
 1ce:   48 01 f8                add    %rdi,%rax
 1d1:   48 39 c2                cmp    %rax,%rdx
 1d4:   0f 93 c0                setae  %al
 1d7:   09 f0                   or     %esi,%eax
 1d9:   84 c1                   test   %al,%cl
 1db:   0f 84 a7 08 00 00       je     a88 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0xa88>
 1e1:   41 83 fc 03             cmp    $0x3,%r12d
 1e5:   0f 86 9d 08 00 00       jbe    a88 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0xa88>
 1eb:   0f 28 d7                movaps %xmm7,%xmm2
 1ee:   41 0f 28 c8             movaps %xmm8,%xmm1
 1f2:   41 8d 74 24 fc          lea    -0x4(%r12),%esi
 1f7:   44 0f 28 f4             movaps %xmm4,%xmm14
 1fb:   0f c6 d2 00             shufps $0x0,%xmm2,%xmm2
 1ff:   4c 8b 95 28 fe ff ff    mov    -0x1d8(%rbp),%r10
 206:   0f c6 c9 00             shufps $0x0,%xmm1,%xmm1
 20a:   c1 ee 02                shr    $0x2,%esi
 20d:   44 0f 28 eb             movaps %xmm3,%xmm13
 211:   83 c6 01                add    $0x1,%esi
 214:   44 0f 28 e5             movaps %xmm5,%xmm12
 218:   8d 0c b5 00 00 00 00    lea    0x0(,%rsi,4),%ecx
 21f:   44 0f 28 de             movaps %xmm6,%xmm11
 223:   31 c0                   xor    %eax,%eax
 225:   45 0f c6 f6 00          shufps $0x0,%xmm14,%xmm14
 22a:   31 ff                   xor    %edi,%edi
 22c:   45 0f c6 ed 00          shufps $0x0,%xmm13,%xmm13
 231:   45 0f c6 e4 00          shufps $0x0,%xmm12,%xmm12
 236:   45 0f c6 db 00          shufps $0x0,%xmm11,%xmm11
 23b:   0f 29 95 f0 fd ff ff    movaps %xmm2,-0x210(%rbp)
 242:   0f 29 8d 00 fe ff ff    movaps %xmm1,-0x200(%rbp)
 249:   83 c7 01                add    $0x1,%edi
 24c:   0f 10 4c 43 10          movups 0x10(%rbx,%rax,2),%xmm1
 251:   0f 10 04 43             movups (%rbx,%rax,2),%xmm0
 255:   44 0f 28 c8             movaps %xmm0,%xmm9
 259:   0f c6 c1 dd             shufps $0xdd,%xmm1,%xmm0
 25d:   41 0f 10 14 42          movups (%r10,%rax,2),%xmm2
 262:   44 0f c6 c9 88          shufps $0x88,%xmm1,%xmm9
 267:   45 0f 10 54 42 10       movups 0x10(%r10,%rax,2),%xmm10
 26d:   0f 28 c8                movaps %xmm0,%xmm1
 270:   0f 59 85 f0 fd ff ff    mulps  -0x210(%rbp),%xmm0
 277:   45 0f 28 f9             movaps %xmm9,%xmm15
 27b:   45 0f 59 cb             mulps  %xmm11,%xmm9
 27f:   45 0f 59 fe             mulps  %xmm14,%xmm15
 283:   41 0f 59 cd             mulps  %xmm13,%xmm1
 287:   44 0f 58 c8             addps  %xmm0,%xmm9
 28b:   41 0f 58 cf             addps  %xmm15,%xmm1
 28f:   44 0f 28 fa             movaps %xmm2,%xmm15
 293:   44 0f 58 8d 00 fe ff    addps  -0x200(%rbp),%xmm9
 29a:   ff 
 29b:   41 0f c6 d2 dd          shufps $0xdd,%xmm10,%xmm2
 2a0:   45 0f c6 fa 88          shufps $0x88,%xmm10,%xmm15
 2a5:   41 0f 58 cc             addps  %xmm12,%xmm1
 2a9:   44 0f 5c ca             subps  %xmm2,%xmm9
 2ad:   41 0f 5c cf             subps  %xmm15,%xmm1
 2b1:   45 0f 59 c9             mulps  %xmm9,%xmm9
 2b5:   0f 59 c9                mulps  %xmm1,%xmm1
 2b8:   41 0f 58 c9             addps  %xmm9,%xmm1
 2bc:   0f 11 0c 02             movups %xmm1,(%rdx,%rax,1)
 2c0:   48 83 c0 10             add    $0x10,%rax
 2c4:   39 fe                   cmp    %edi,%esi
 2c6:   77 81                   ja     249 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x249>
 2c8:   41 39 cc                cmp    %ecx,%r12d
 2cb:   0f 84 3c 01 00 00       je     40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
 2d1:   48 63 f1                movslq %ecx,%rsi
 2d4:   4c 8b 9d 28 fe ff ff    mov    -0x1d8(%rbp),%r11
 2db:   48 8d 04 f5 00 00 00    lea    0x0(,%rsi,8),%rax
 2e2:   00 
 2e3:   48 8d 3c 03             lea    (%rbx,%rax,1),%rdi
 2e7:   4c 01 d8                add    %r11,%rax
 2ea:   f3 0f 10 07             movss  (%rdi),%xmm0
 2ee:   f3 0f 10 4f 04          movss  0x4(%rdi),%xmm1
 2f3:   44 0f 28 c8             movaps %xmm0,%xmm9
 2f7:   0f 28 d1                movaps %xmm1,%xmm2
 2fa:   f3 0f 59 c6             mulss  %xmm6,%xmm0
 2fe:   f3 0f 59 cf             mulss  %xmm7,%xmm1
 302:   f3 44 0f 59 cc          mulss  %xmm4,%xmm9
 307:   f3 0f 59 d3             mulss  %xmm3,%xmm2
 30b:   f3 0f 58 c8             addss  %xmm0,%xmm1
 30f:   f3 41 0f 58 d1          addss  %xmm9,%xmm2
 314:   f3 41 0f 58 c8          addss  %xmm8,%xmm1
 319:   f3 0f 58 d5             addss  %xmm5,%xmm2
 31d:   0f 28 c1                movaps %xmm1,%xmm0
 320:   f3 0f 5c 10             subss  (%rax),%xmm2
 324:   f3 0f 5c 40 04          subss  0x4(%rax),%xmm0
 329:   f3 0f 59 d2             mulss  %xmm2,%xmm2
 32d:   f3 0f 59 c0             mulss  %xmm0,%xmm0
 331:   f3 0f 58 c2             addss  %xmm2,%xmm0
 335:   f3 0f 11 04 b2          movss  %xmm0,(%rdx,%rsi,4)
 33a:   8d 71 01                lea    0x1(%rcx),%esi
 33d:   41 39 f4                cmp    %esi,%r12d
 340:   0f 8e c7 00 00 00       jle    40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
 346:   48 63 f6                movslq %esi,%rsi
 349:   83 c1 02                add    $0x2,%ecx
 34c:   48 8d 04 f5 00 00 00    lea    0x0(,%rsi,8),%rax
 353:   00 
 354:   48 8d 3c 03             lea    (%rbx,%rax,1),%rdi
 358:   4c 01 d8                add    %r11,%rax
 35b:   41 39 cc                cmp    %ecx,%r12d
 35e:   f3 0f 10 07             movss  (%rdi),%xmm0
 362:   f3 0f 10 4f 04          movss  0x4(%rdi),%xmm1
 367:   44 0f 28 c8             movaps %xmm0,%xmm9
 36b:   0f 28 d1                movaps %xmm1,%xmm2
 36e:   f3 0f 59 c6             mulss  %xmm6,%xmm0
 372:   f3 0f 59 cf             mulss  %xmm7,%xmm1
 376:   f3 44 0f 59 cc          mulss  %xmm4,%xmm9
 37b:   f3 0f 59 d3             mulss  %xmm3,%xmm2
 37f:   f3 0f 58 c8             addss  %xmm0,%xmm1
 383:   f3 41 0f 58 d1          addss  %xmm9,%xmm2
 388:   f3 41 0f 58 c8          addss  %xmm8,%xmm1
 38d:   f3 0f 58 d5             addss  %xmm5,%xmm2
 391:   f3 0f 5c 48 04          subss  0x4(%rax),%xmm1
 396:   f3 0f 5c 10             subss  (%rax),%xmm2
 39a:   0f 28 c1                movaps %xmm1,%xmm0
 39d:   f3 0f 59 d2             mulss  %xmm2,%xmm2
 3a1:   f3 0f 59 c1             mulss  %xmm1,%xmm0
 3a5:   f3 0f 58 c2             addss  %xmm2,%xmm0
 3a9:   f3 0f 11 04 b2          movss  %xmm0,(%rdx,%rsi,4)
 3ae:   7e 5d                   jle    40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
 3b0:   48 63 c9                movslq %ecx,%rcx
 3b3:   48 8d 04 cd 00 00 00    lea    0x0(,%rcx,8),%rax
 3ba:   00 
 3bb:   48 01 c3                add    %rax,%rbx
 3be:   48 03 85 28 fe ff ff    add    -0x1d8(%rbp),%rax
 3c5:   f3 0f 10 0b             movss  (%rbx),%xmm1
 3c9:   f3 0f 10 43 04          movss  0x4(%rbx),%xmm0
 3ce:   f3 0f 59 e1             mulss  %xmm1,%xmm4
 3d2:   f3 0f 59 d8             mulss  %xmm0,%xmm3
 3d6:   f3 0f 59 f1             mulss  %xmm1,%xmm6
 3da:   f3 0f 59 f8             mulss  %xmm0,%xmm7
 3de:   f3 0f 58 dc             addss  %xmm4,%xmm3
 3e2:   f3 0f 58 fe             addss  %xmm6,%xmm7
 3e6:   f3 0f 58 eb             addss  %xmm3,%xmm5
 3ea:   f3 44 0f 58 c7          addss  %xmm7,%xmm8
 3ef:   f3 0f 5c 28             subss  (%rax),%xmm5
 3f3:   f3 44 0f 5c 40 04       subss  0x4(%rax),%xmm8
 3f9:   f3 0f 59 ed             mulss  %xmm5,%xmm5
 3fd:   f3 45 0f 59 c0          mulss  %xmm8,%xmm8
 402:   f3 44 0f 58 c5          addss  %xmm5,%xmm8
 407:   f3 44 0f 11 04 8a       movss  %xmm8,(%rdx,%rcx,4)

for our function. It uses only SSE, but it IMHO it did quite a good job and unrolled the loop quite agressively. Even we can't beat it:

Geometric mean

                           Name of Test                                 calib3d         calib3d         calib3d    
                                                                         posix           posix           posix     
                                                                          x64             x64             x64      
                                                                        18b2e2d         d56de49         d56de49    
                                                                    20160815-201729 20160815-204020 20160815-204020
                                                                                                          vs       
                                                                                                        calib3d    
                                                                                                         posix     
                                                                                                          x64      
                                                                                                        18b2e2d    
                                                                                                    20160815-201729
                                                                                                      (x-factor)   
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                 0.063 ms        0.064 ms          0.99      
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                0.136 ms        0.140 ms          0.97      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                0.041 ms        0.042 ms          0.97      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)               0.106 ms        0.109 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                0.076 ms        0.077 ms          0.99      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)               0.116 ms        0.118 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)               0.051 ms        0.052 ms          0.97      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)              0.115 ms        0.119 ms          0.97      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                0.122 ms        0.121 ms          1.00      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)               0.161 ms        0.164 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)               0.074 ms        0.077 ms          0.96      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)              0.139 ms        0.143 ms          0.97      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                4.610 ms        4.644 ms          0.99      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)               6.933 ms        6.949 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)               0.755 ms        0.787 ms          0.96      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)              2.192 ms        2.226 ms          0.98      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)               5.703 ms        5.583 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)              7.971 ms        7.961 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)              0.953 ms        0.987 ms          0.97      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)             2.373 ms        2.411 ms          0.98      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)               8.842 ms        8.779 ms          1.01      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)              9.989 ms        9.963 ms          1.00      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)              1.445 ms        1.495 ms          0.97      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)             2.852 ms        2.931 ms          0.97      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)             132.902 ms      131.835 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)            196.302 ms      196.587 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)             14.889 ms       15.431 ms         0.96      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)            45.945 ms       46.190 ms         0.99      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)            162.923 ms      162.030 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)           227.114 ms      226.551 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)            18.786 ms       19.563 ms         0.96      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)           49.712 ms       50.631 ms         0.98      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)            247.388 ms      245.191 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)           270.449 ms      267.962 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)            28.617 ms       29.760 ms         0.96      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)           59.735 ms       61.002 ms         0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)          0.034 ms        0.034 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)         0.095 ms        0.097 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)         0.019 ms        0.019 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)        0.073 ms        0.076 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)         0.043 ms        0.042 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)        0.105 ms        0.107 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)        0.023 ms        0.023 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)       0.076 ms        0.079 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)         0.065 ms        0.065 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)        0.098 ms        0.098 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)        0.031 ms        0.031 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)       0.085 ms        0.087 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)         2.050 ms        2.046 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)        3.307 ms        3.242 ms          1.02      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)        0.323 ms        0.337 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)       1.140 ms        1.143 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)        2.794 ms        2.818 ms          0.99      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)       4.080 ms        3.965 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)       0.404 ms        0.418 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)      1.203 ms        1.191 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)        4.566 ms        4.530 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)       5.180 ms        5.149 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)       0.578 ms        0.604 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)      1.383 ms        1.424 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)       58.952 ms       58.513 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)      84.726 ms       84.658 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)      6.283 ms        6.559 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)     19.711 ms       19.652 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)      79.084 ms       78.624 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)    104.451 ms      104.073 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)     7.823 ms        8.225 ms          0.95      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)    20.907 ms       21.467 ms         0.97      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)     126.398 ms      125.661 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)    152.708 ms      151.957 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)     11.366 ms       11.774 ms         0.97      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)    24.299 ms       25.473 ms         0.95      

But on the other hand we are at least not slower.

I'm not against including this, if you can confirm the speedup. (I think we can call it optimal if we are the speed of gcc :) It can also run on NEON and compilers didn't used to be that good there, but I don't have an ARM to test it.

BTW what compiler are you using?

hrnr commented 7 years ago

I have also tested on my laptop with gcc 6.1.1 and Core i5-2520M and we are ~15% slower than gcc in cases where error computing matters the most (RANSAC without refining):

 Geometric mean

                           Name of Test                                 calib3d         calib3d         calib3d    
                                                                         posix           posix           posix     
                                                                          x64             x64             x64      
                                                                        150daa2         150daa2         150daa2    
                                                                    20160815-214220 20160815-213642 20160815-213642
                                                                                                          vs       
                                                                                                        calib3d    
                                                                                                         posix     
                                                                                                          x64      
                                                                                                        150daa2    
                                                                                                    20160815-214220
                                                                                                      (x-factor)   
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                 0.027 ms        0.026 ms          1.02      
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                0.066 ms        0.068 ms          0.97      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                0.022 ms        0.022 ms          1.01      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)               0.055 ms        0.056 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                0.033 ms        0.032 ms          1.02      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)               0.054 ms        0.055 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)               0.026 ms        0.026 ms          0.99      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)              0.059 ms        0.060 ms          0.98      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                0.057 ms        0.060 ms          0.96      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)               0.080 ms        0.084 ms          0.96      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)               0.037 ms        0.039 ms          0.95      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)              0.070 ms        0.073 ms          0.97      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                2.786 ms        2.712 ms          1.03      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)               3.920 ms        3.888 ms          1.01      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)               0.242 ms        0.298 ms          0.81      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)              0.916 ms        0.942 ms          0.97      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)               3.403 ms        3.313 ms          1.03      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)              4.518 ms        4.474 ms          1.01      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)              0.300 ms        0.364 ms          0.83      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)             0.966 ms        1.006 ms          0.96      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)               5.325 ms        5.215 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)              5.917 ms        5.796 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)              0.443 ms        0.540 ms          0.82      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)             1.111 ms        1.188 ms          0.94      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)              81.191 ms       78.670 ms         1.03      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)            137.046 ms      146.944 ms         0.93      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)             5.141 ms        6.478 ms          0.79      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)            33.305 ms       37.155 ms         0.90      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)            100.009 ms      101.746 ms         0.98      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)           155.763 ms      155.185 ms         1.00      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)            6.407 ms        7.555 ms          0.85      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)           34.567 ms       36.710 ms         0.94      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)            150.923 ms      146.320 ms         1.03      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)           172.952 ms      169.110 ms         1.02      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)            9.502 ms        11.193 ms         0.85      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)           37.766 ms       40.142 ms         0.94      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)          0.013 ms        0.013 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)         0.045 ms        0.048 ms          0.93      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)         0.010 ms        0.010 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)        0.040 ms        0.040 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)         0.017 ms        0.016 ms          1.06      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)        0.050 ms        0.054 ms          0.91      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)        0.012 ms        0.012 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)       0.041 ms        0.043 ms          0.95      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)         0.028 ms        0.028 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)        0.045 ms        0.046 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)        0.016 ms        0.016 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)       0.045 ms        0.047 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)         1.284 ms        1.286 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)        1.950 ms        1.885 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)        0.124 ms        0.145 ms          0.86      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)       0.527 ms        0.551 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)        1.692 ms        1.753 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)       2.284 ms        2.349 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)       0.148 ms        0.173 ms          0.86      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)      0.551 ms        0.601 ms          0.92      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)        2.751 ms        2.822 ms          0.97      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)       3.041 ms        3.162 ms          0.96      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)       0.190 ms        0.229 ms          0.83      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)      0.603 ms        0.656 ms          0.92      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)       35.884 ms       35.357 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)      58.332 ms       57.782 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)      2.384 ms        2.718 ms          0.88      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)     12.464 ms       12.585 ms         0.99      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)      48.371 ms       48.015 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)     70.765 ms       72.106 ms         0.98      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)     2.879 ms        3.361 ms          0.86      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)    13.251 ms       13.377 ms         0.99      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)      77.060 ms       76.753 ms         1.00      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)     99.427 ms       97.301 ms         1.02      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)     4.529 ms        4.601 ms          0.98      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)    14.308 ms       14.395 ms         0.99      

So I'm not sure if we should include vectorized version or not. I have rebased all changes into 150daa2dc57a258ba61a01e12901518b6b4d98e8.

I think I will wait on opinions of @prclibo and @alalek about this.

mself commented 7 years ago

I agree. We should not include this if it is slower in any common case.

My tests were with LLVM 7.3.0 (clang-703.0.29). I will investigate to see if there is some reason why vectorization isn't enabled in the OpenCV build settings. I think there is also a debug mode that will tell you why a loop wasn't vectorized.

Others with more experience with vectorization may be able to shed light on the best approach (possibly do nothing and leave it to the compiler).

mself commented 7 years ago

BTW, I have a few other performance ideas:

    • Combining computeError() into findInliers() could avoid iterating over all of the points a second time and eliminate the need to write to and then read from the err array. But it reduces the separation between the functions, which adds complexity. This would help RANSAC (for large numbers of points), but not LMedS. I think it could be worth measuring to see if it is significant.
    • LMedS currently estimates the number of iterations needed using a fixed 45% estimate for the number of outliers. On the other hand, RANSAC uses the actual percentage of outliers found and will lower the number of iterations needed as better models are found. When the actual outlier percentage is much lower than 45%, I think LMedS is doing more iterations than needed. It seems like LMedS could be changed to use a similar adaptive approach to reduce the number of iterations when the outlier percentage is low. The tradeoff is that you will make more calls to findInliers() (currently it only calls it once at the end). It would be great to get input from someone more familiar with LMedS to make sure there isn't a flaw with this idea.
    • If the LM solver is quite slow but just using the best model estimated from 2 points is too inaccurate, then we could add a final least squares step that calls runKernel() with all of the points (like findHomography() does). Or, adding this step could reduce the number of steps that LM takes to converge.
prclibo commented 7 years ago

hey @hrnr @mself thanks for the improvement. the discussion's insightful:)

hrnr commented 7 years ago

@prclibo What do you think about manually vectorized version of computeError(), should I include it?

prclibo commented 7 years ago

@hrnr Honestly I do not know about SSE optimization=(. My personal opinion: It is good to have an optimized and fast code. But if not fully understanding about the optimization mechanism, it is also fine to keep the implementation as simple as it is.

mself commented 7 years ago

It looks like clang isn't able to vectorize loops like the one in Affine2DEstimatorCallback ::computeError(). I tried the following simple loop that has a similar interleaved access pattern.

typedef struct {
    float x;
    float y;
} point;

void bar (const point *a, const point *b, float *c, int n)
{
    for (int i = 0; i < n; i++) {
        c[i] = (a[i].x * b[i].x) + (a[i].y * b[i].y);
    }
}

When compiled with -Rpass-analysis=loop-vectorize, I get the remarks:

test.cpp:51:22: remark: the cost-model indicates that vectorization is not beneficial
      [-Rpass-analysis=loop-vectorize]
        c[i] = (a[i].x * b[i].x) + (a[i].y * b[i].y);
                     ^
test.cpp:51:22: remark: the cost-model indicates that interleaving is not beneficial
      [-Rpass-analysis=loop-vectorize]

I also tried using a #pragma to force clang to vectorize the loop, but the results were very poor. It generated lots of non-vectorized instructions along with a few vectorized ones and a lot of shuffle instructions.

So one option would be to include the manually vectorized code but only enable it for clang. Another option would be for me to switch to GCC when compiling OpenCV :-)

mself commented 7 years ago

I'd like to include a vectorized version for clang, but I wasn't happy that the universal intrinsics was missing the 2-channel v_load_deinterleave(), which was resulting in slower code. So I added the 2-channel float version for SSE and NEON 67d632c9464bbdfa072fe4963193a60f90c3ab48!

I then updated the vectorized version to use it ef663f4fb35dbe023f69496ee03fdd13ff2e2b5e, and I get slightly better performance numbers now (and the vectorized code is much simpler). The generated code looks the same as what you showed from gcc 4.9.3, so I am hopeful that this version will be no slower than any of the auto-vectorized versions.

Here are the perf results:

Geometric mean

                           Name of Test                                 calib3d        calib3d        calib3d         calib3d    
                                                                         posix          posix          posix           posix     
                                                                          x64            x64            x64             x64      
                                                                        5cde391        5cde391        5cde391         5cde391    
                                                                    20160817-nosimd 20160817-simd  20160817-simd   20160817-simd 
                                                                                                        vs              vs       
                                                                                                      calib3d         calib3d    
                                                                                                       posix           posix     
                                                                                                        x64             x64      
                                                                                                      5cde391         5cde391    
                                                                                                  20160817-nosimd 20160817-nosimd
                                                                                                    (x-factor)        (score)    
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                 0.023 ms       0.022 ms         1.06           faster     
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                0.053 ms       0.054 ms         0.98                      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                0.024 ms       0.020 ms         1.19           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)               0.049 ms       0.047 ms         1.04           faster     
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                0.029 ms       0.027 ms         1.07           faster     
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)               0.045 ms       0.046 ms         0.98                      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)               0.030 ms       0.025 ms         1.19           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)              0.054 ms       0.053 ms         1.01                      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                0.052 ms       0.049 ms         1.04                      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)               0.068 ms       0.069 ms         0.99                      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)               0.043 ms       0.036 ms         1.22           FASTER     
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)              0.069 ms       0.061 ms         1.12           faster     
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                2.369 ms       2.292 ms         1.03                      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)               3.329 ms       3.266 ms         1.02                      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)               0.399 ms       0.200 ms         1.99           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)              0.976 ms       0.761 ms         1.28           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)               2.882 ms       2.778 ms         1.04                      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)              3.843 ms       3.741 ms         1.03                      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)              0.499 ms       0.246 ms         2.02           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)             1.063 ms       0.799 ms         1.33           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)               4.419 ms       4.326 ms         1.02                      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)              4.916 ms       4.733 ms         1.04                      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)              0.755 ms       0.368 ms         2.05           FASTER     
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)             1.299 ms       0.912 ms         1.42           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)              69.528 ms      67.853 ms        1.02                      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)            119.832 ms     116.765 ms        1.03                      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)             7.546 ms       3.920 ms         1.92           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)            21.852 ms      19.590 ms        1.12           faster     
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)             85.240 ms      85.687 ms        0.99                      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)           138.463 ms     136.450 ms        1.01                      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)            9.314 ms       4.879 ms         1.91           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)           23.353 ms      20.709 ms        1.13           faster     
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)            128.292 ms     131.003 ms        0.98                      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)           142.334 ms     148.350 ms        0.96                      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)            13.972 ms      6.886 ms         2.03           FASTER     
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)           28.126 ms      22.698 ms        1.24           faster     
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)          0.012 ms       0.011 ms         1.08           faster     
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)         0.034 ms       0.036 ms         0.96                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)         0.011 ms       0.010 ms         1.14           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)        0.031 ms       0.031 ms         1.00                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)         0.015 ms       0.014 ms         1.04                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)        0.039 ms       0.039 ms         1.01                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)        0.014 ms       0.012 ms         1.18           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)       0.034 ms       0.033 ms         1.01                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)         0.023 ms       0.021 ms         1.09           faster     
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)        0.035 ms       0.035 ms         1.01                      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)        0.019 ms       0.015 ms         1.20           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)       0.038 ms       0.037 ms         1.05           faster     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)         1.097 ms       1.081 ms         1.02                      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)        1.586 ms       1.566 ms         1.01                      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)        0.168 ms       0.098 ms         1.71           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)       0.523 ms       0.450 ms         1.16           faster     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)        1.468 ms       1.453 ms         1.01                      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)       1.966 ms       1.918 ms         1.02                      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)       0.211 ms       0.113 ms         1.87           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)      0.561 ms       0.468 ms         1.20           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)        2.286 ms       2.314 ms         0.99                      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)       2.551 ms       2.508 ms         1.02                      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)       0.301 ms       0.156 ms         1.92           FASTER     
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)      0.654 ms       0.510 ms         1.28           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)       29.840 ms      30.123 ms        0.99                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)      43.798 ms      44.995 ms        0.97                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)      3.211 ms       1.844 ms         1.74           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)     9.677 ms       8.013 ms         1.21           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)      40.414 ms      40.970 ms        0.99                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)     54.711 ms      56.088 ms        0.98                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)     3.990 ms       2.239 ms         1.78           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)    10.130 ms      8.520 ms         1.19           faster     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)      64.602 ms      65.833 ms        0.98                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)     80.721 ms      78.250 ms        1.03                      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)     5.635 ms       2.982 ms         1.89           FASTER     
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)    11.751 ms      9.224 ms         1.27           FASTER     
hrnr commented 7 years ago

Thanks for the 2-channel deinterleave. It was on my TODO list. Looks nice.

I will test your version with 2-channel deinterleave on GCC 6.

I have took a look on what GCC 6 produces and it s not too different from version of GCC 4.9. It uses the same vectorization approach. However intruction ordering is different, GCC 6 prefers to stick movaps before movss memory loads in each iterations. Math is the same, but GCC6 was able to safe 1 movaps, which was blocking second subss in some cases by better working with temporal results. A beatiful job from a compiler!

 1d7:   0f 86 03 07 00 00       jbe    8e0 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x8e0>
 1dd:   0f 28 cf                movaps %xmm7,%xmm1
 1e0:   41 8d 74 24 fc          lea    -0x4(%r12),%esi
 1e5:   4c 8b 95 38 fe ff ff    mov    -0x1c8(%rbp),%r10
 1ec:   44 0f 28 f4             movaps %xmm4,%xmm14
 1f0:   31 c0                   xor    %eax,%eax
 1f2:   0f c6 c9 00             shufps $0x0,%xmm1,%xmm1
 1f6:   c1 ee 02                shr    $0x2,%esi
 1f9:   44 0f 28 eb             movaps %xmm3,%xmm13
 1fd:   83 c6 01                add    $0x1,%esi
 200:   44 0f 28 e5             movaps %xmm5,%xmm12
 204:   8d 14 b5 00 00 00 00    lea    0x0(,%rsi,4),%edx
 20b:   0f 29 8d 00 fe ff ff    movaps %xmm1,-0x200(%rbp)
 212:   41 0f 28 c8             movaps %xmm8,%xmm1
 216:   31 ff                   xor    %edi,%edi
 218:   44 0f 28 de             movaps %xmm6,%xmm11
 21c:   0f c6 c9 00             shufps $0x0,%xmm1,%xmm1
 220:   45 0f c6 f6 00          shufps $0x0,%xmm14,%xmm14
 225:   45 0f c6 ed 00          shufps $0x0,%xmm13,%xmm13
 22a:   45 0f c6 e4 00          shufps $0x0,%xmm12,%xmm12
 22f:   45 0f c6 db 00          shufps $0x0,%xmm11,%xmm11
 234:   0f 29 8d 20 fe ff ff    movaps %xmm1,-0x1e0(%rbp)
 23b:   45 0f 28 fe             movaps %xmm14,%xmm15
 23f:   83 c7 01                add    $0x1,%edi
 242:   0f 10 04 43             movups (%rbx,%rax,2),%xmm0
 246:   0f 10 4c 43 10          movups 0x10(%rbx,%rax,2),%xmm1
 24b:   44 0f 28 c8             movaps %xmm0,%xmm9
 24f:   0f c6 c1 dd             shufps $0xdd,%xmm1,%xmm0
 253:   41 0f 10 14 42          movups (%r10,%rax,2),%xmm2
 258:   44 0f c6 c9 88          shufps $0x88,%xmm1,%xmm9
 25d:   41 0f 28 cd             movaps %xmm13,%xmm1
 261:   45 0f 10 54 42 10       movups 0x10(%r10,%rax,2),%xmm10
 267:   0f 59 c8                mulps  %xmm0,%xmm1
 26a:   0f 59 85 00 fe ff ff    mulps  -0x200(%rbp),%xmm0
 271:   45 0f 59 f9             mulps  %xmm9,%xmm15
 275:   45 0f 59 cb             mulps  %xmm11,%xmm9
 279:   41 0f 58 cf             addps  %xmm15,%xmm1
 27d:   44 0f 28 fa             movaps %xmm2,%xmm15
 281:   41 0f 58 c1             addps  %xmm9,%xmm0
 285:   41 0f c6 d2 dd          shufps $0xdd,%xmm10,%xmm2
 28a:   45 0f c6 fa 88          shufps $0x88,%xmm10,%xmm15
 28f:   41 0f 58 cc             addps  %xmm12,%xmm1
 293:   0f 58 85 20 fe ff ff    addps  -0x1e0(%rbp),%xmm0
 29a:   41 0f 5c cf             subps  %xmm15,%xmm1
 29e:   0f 5c c2                subps  %xmm2,%xmm0
 2a1:   0f 59 c9                mulps  %xmm1,%xmm1
 2a4:   0f 59 c0                mulps  %xmm0,%xmm0
 2a7:   0f 58 c8                addps  %xmm0,%xmm1
 2aa:   0f 11 0c 01             movups %xmm1,(%rcx,%rax,1)
 2ae:   48 83 c0 10             add    $0x10,%rax
 2b2:   39 f7                   cmp    %esi,%edi
 2b4:   72 85                   jb     23b <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x23b>
 2b6:   44 39 e2                cmp    %r12d,%edx
 2b9:   0f 84 2e 01 00 00       je     3ed <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x3ed>
 2bf:   48 63 f2                movslq %edx,%rsi
 2c2:   0f 28 d3                movaps %xmm3,%xmm2
 2c5:   48 8d 04 f5 00 00 00    lea    0x0(,%rsi,8),%rax
 2cc:   00 
 2cd:   44 0f 28 cc             movaps %xmm4,%xmm9
 2d1:   48 8d 3c 03             lea    (%rbx,%rax,1),%rdi
 2d5:   4c 01 d0                add    %r10,%rax
 2d8:   f3 0f 10 0f             movss  (%rdi),%xmm1
 2dc:   f3 0f 10 47 04          movss  0x4(%rdi),%xmm0
 2e1:   f3 44 0f 59 c9          mulss  %xmm1,%xmm9
 2e6:   f3 0f 59 d0             mulss  %xmm0,%xmm2
 2ea:   f3 0f 59 ce             mulss  %xmm6,%xmm1
 2ee:   f3 0f 59 c7             mulss  %xmm7,%xmm0
 2f2:   f3 41 0f 58 d1          addss  %xmm9,%xmm2
 2f7:   f3 0f 58 c1             addss  %xmm1,%xmm0
 2fb:   f3 0f 58 d5             addss  %xmm5,%xmm2
 2ff:   f3 41 0f 58 c0          addss  %xmm8,%xmm0
 304:   f3 0f 5c 10             subss  (%rax),%xmm2
 308:   f3 0f 5c 40 04          subss  0x4(%rax),%xmm0
 30d:   8d 42 01                lea    0x1(%rdx),%eax
 310:   41 39 c4                cmp    %eax,%r12d
 313:   f3 0f 59 d2             mulss  %xmm2,%xmm2
 317:   f3 0f 59 c0             mulss  %xmm0,%xmm0
 31b:   f3 0f 58 c2             addss  %xmm2,%xmm0
 31f:   f3 0f 11 04 b1          movss  %xmm0,(%rcx,%rsi,4)
 324:   0f 8e c3 00 00 00       jle    3ed <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x3ed>
 32a:   48 98                   cltq   
 32c:   0f 28 d3                movaps %xmm3,%xmm2
 32f:   48 8d 34 c5 00 00 00    lea    0x0(,%rax,8),%rsi
 336:   00 
 337:   44 0f 28 cc             movaps %xmm4,%xmm9
 33b:   83 c2 02                add    $0x2,%edx
 33e:   48 8d 3c 33             lea    (%rbx,%rsi,1),%rdi
 342:   4c 01 d6                add    %r10,%rsi
 345:   44 39 e2                cmp    %r12d,%edx
 348:   f3 0f 10 07             movss  (%rdi),%xmm0
 34c:   f3 0f 10 4f 04          movss  0x4(%rdi),%xmm1
 351:   f3 44 0f 59 c8          mulss  %xmm0,%xmm9
 356:   f3 0f 59 d1             mulss  %xmm1,%xmm2
 35a:   f3 0f 59 c6             mulss  %xmm6,%xmm0
 35e:   f3 0f 59 cf             mulss  %xmm7,%xmm1
 362:   f3 41 0f 58 d1          addss  %xmm9,%xmm2
 367:   f3 0f 58 c1             addss  %xmm1,%xmm0
 36b:   f3 0f 58 d5             addss  %xmm5,%xmm2
 36f:   f3 41 0f 58 c0          addss  %xmm8,%xmm0
 374:   f3 0f 5c 16             subss  (%rsi),%xmm2
 378:   f3 0f 5c 46 04          subss  0x4(%rsi),%xmm0
 37d:   f3 0f 59 d2             mulss  %xmm2,%xmm2
 381:   f3 0f 59 c0             mulss  %xmm0,%xmm0
 385:   f3 0f 58 c2             addss  %xmm2,%xmm0
 389:   f3 0f 11 04 81          movss  %xmm0,(%rcx,%rax,4)
 38e:   7d 5d                   jge    3ed <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x3ed>
 390:   48 63 d2                movslq %edx,%rdx
 393:   48 8d 04 d5 00 00 00    lea    0x0(,%rdx,8),%rax
 39a:   00 
 39b:   48 01 c3                add    %rax,%rbx
 39e:   48 03 85 38 fe ff ff    add    -0x1c8(%rbp),%rax
 3a5:   f3 0f 10 0b             movss  (%rbx),%xmm1
 3a9:   f3 0f 10 43 04          movss  0x4(%rbx),%xmm0
 3ae:   f3 0f 59 e1             mulss  %xmm1,%xmm4
 3b2:   f3 0f 59 d8             mulss  %xmm0,%xmm3
 3b6:   f3 0f 59 f1             mulss  %xmm1,%xmm6
 3ba:   f3 0f 59 c7             mulss  %xmm7,%xmm0
 3be:   f3 0f 58 dc             addss  %xmm4,%xmm3
 3c2:   f3 0f 58 c6             addss  %xmm6,%xmm0
 3c6:   f3 0f 58 eb             addss  %xmm3,%xmm5
 3ca:   f3 44 0f 58 c0          addss  %xmm0,%xmm8
 3cf:   f3 0f 5c 28             subss  (%rax),%xmm5
 3d3:   f3 44 0f 5c 40 04       subss  0x4(%rax),%xmm8
 3d9:   f3 0f 59 ed             mulss  %xmm5,%xmm5
 3dd:   f3 45 0f 59 c0          mulss  %xmm8,%xmm8
 3e2:   f3 44 0f 58 c5          addss  %xmm5,%xmm8
 3e7:   f3 44 0f 11 04 91       movss  %xmm8,(%rcx,%rdx,4)
 3ed:   48 8b 45 a8             mov    -0x58(%rbp),%rax
 3f1:   48 85 c0                test   %rax,%rax
 3f4:   74 13                   je     409 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x409>
hrnr commented 7 years ago

I have tested your 2-channel version. GCC just went crazy with:

../modules/core/include/opencv2/core/hal/intrin_sse.hpp: In function ‘void cv::v_store_interleave(float*, const cv::v_float32x4&, const cv::v_float32x4&)’:
../modules/core/include/opencv2/core/hal/intrin_sse.hpp:1547:15: warning: unused variable ‘mask_lo’ [-Wunused-variable]
     const int mask_lo = _MM_SHUFFLE(2, 0, 2, 0), mask_hi = _MM_SHUFFLE(3, 1, 3, 1);
               ^~~~~~~
../modules/core/include/opencv2/core/hal/intrin_sse.hpp:1547:50: warning: unused variable ‘mask_hi’ [-Wunused-variable]
     const int mask_lo = _MM_SHUFFLE(2, 0, 2, 0), mask_hi = _MM_SHUFFLE(3, 1, 3, 1);
                                                  ^~~~~~~

I don't think its safe to use variables as control for _mm_shuffle_ps since it actually generates shufps. Control for shufps (imm8) must be immediate. And that's why gcc reports unused variables.

I have run the tests on this version and it is the fastest manually vectorized version. About ~5% faster than previous version. I think it is a good job and the code is perfectly readable.

But the gcc 6.1 is still better about ~10%.

Geometric mean

                           Name of Test                              calib3d    calib3d    calib3d  
                                                                      posix      posix      posix   
                                                                       x64        x64        x64    
                                                                     ef663f4    ef663f4    ef663f4  
                                                                       auto    vectorized vectorized
                                                                       vec                    vs    
                                                                       only                calib3d  
                                                                                            posix   
                                                                                             x64    
                                                                                           ef663f4  
                                                                                             auto   
                                                                                             vec    
                                                                                             only   
                                                                                          (x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)               0.029 ms   0.031 ms     0.94   
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)              0.072 ms   0.074 ms     0.97   
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)              0.021 ms   0.021 ms     0.98   
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)             0.058 ms   0.059 ms     0.99   
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)              0.036 ms   0.036 ms     0.98   
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)             0.058 ms   0.061 ms     0.96   
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)             0.027 ms   0.026 ms     1.04   
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)            0.060 ms   0.064 ms     0.93   
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)              0.060 ms   0.063 ms     0.95   
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)             0.091 ms   0.088 ms     1.04   
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)             0.039 ms   0.037 ms     1.04   
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)            0.075 ms   0.076 ms     0.99   
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)              2.781 ms   2.709 ms     1.03   
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)             4.120 ms   3.870 ms     1.06   
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)             0.257 ms   0.276 ms     0.93   
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)            0.952 ms   0.924 ms     1.03   
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)             3.568 ms   3.383 ms     1.05   
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)            4.792 ms   4.729 ms     1.01   
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)            0.298 ms   0.342 ms     0.87   
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)           1.034 ms   0.982 ms     1.05   
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)             5.309 ms   5.224 ms     1.02   
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)            6.017 ms   6.117 ms     0.98   
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)            0.445 ms   0.478 ms     0.93   
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)           1.112 ms   1.145 ms     0.97   
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)           81.476 ms  80.553 ms     1.01   
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)          138.876 ms 139.970 ms    0.99   
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)           4.907 ms   5.358 ms     0.92   
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)         33.169 ms  33.396 ms     0.99   
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)          102.326 ms 97.172 ms     1.05   
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)         160.907 ms 152.469 ms    1.06   
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)          6.150 ms   6.692 ms     0.92   
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)        35.220 ms  34.744 ms     1.01   
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)          151.706 ms 146.966 ms    1.03   
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)         176.685 ms 168.195 ms    1.05   
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)          9.078 ms  10.040 ms     0.90   
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)        37.006 ms  38.464 ms     0.96   
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)        0.013 ms   0.013 ms     0.99   
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)       0.047 ms   0.048 ms     0.97   
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)       0.010 ms   0.010 ms     1.00   
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)      0.040 ms   0.041 ms     0.99   
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)       0.017 ms   0.017 ms     0.97   
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)      0.052 ms   0.053 ms     0.98   
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)      0.012 ms   0.012 ms     1.02   
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)     0.042 ms   0.043 ms     0.98   
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)       0.028 ms   0.027 ms     1.04   
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)      0.045 ms   0.046 ms     0.98   
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)      0.016 ms   0.015 ms     1.03   
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)     0.046 ms   0.047 ms     1.00   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)       1.319 ms   1.253 ms     1.05   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)      1.982 ms   1.973 ms     1.00   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)      0.120 ms   0.130 ms     0.92   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)     0.521 ms   0.533 ms     0.98   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)      1.762 ms   1.745 ms     1.01   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)     2.465 ms   2.439 ms     1.01   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)     0.138 ms   0.156 ms     0.89   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)    0.555 ms   0.529 ms     1.05   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)      2.747 ms   2.781 ms     0.99   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)     3.158 ms   3.201 ms     0.99   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)     0.201 ms   0.215 ms     0.94   
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)    0.574 ms   0.585 ms     0.98   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)    35.946 ms  34.784 ms     1.03   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)   57.799 ms  57.446 ms     1.01   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)    2.298 ms   2.523 ms     0.91   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)  12.577 ms  13.177 ms     0.95   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)   48.473 ms  46.815 ms     1.04   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)  72.757 ms  69.004 ms     1.05   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)   2.756 ms   2.987 ms     0.92   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 13.268 ms  13.391 ms     0.99   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)   77.844 ms  74.696 ms     1.04   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)  101.374 ms 97.356 ms     1.04   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)   3.846 ms   4.175 ms     0.92   
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 13.957 ms  14.556 ms     0.96   

I have also tested latest clang 3.8.1 and I can confirm your results. For clang the manually-vectorized version is faster.


Geometric mean

                           Name of Test                                 calib3d         calib3d         calib3d    
                                                                         posix           posix           posix     
                                                                          x64             x64             x64      
                                                                        d56de49         d56de49         d56de49    
                                                                    20160817-103112 20160817-103452 20160817-103452
                                                                                                          vs       
                                                                                                        calib3d    
                                                                                                         posix     
                                                                                                          x64      
                                                                                                        d56de49    
                                                                                                    20160817-103112
                                                                                                      (x-factor)   
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0)                 0.068 ms        0.066 ms          1.03      
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10)                0.137 ms        0.133 ms          1.03      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0)                0.042 ms        0.036 ms          1.14      
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10)               0.102 ms        0.097 ms          1.06      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0)                0.081 ms        0.078 ms          1.04      
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10)               0.116 ms        0.112 ms          1.03      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0)               0.052 ms        0.044 ms          1.17      
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10)              0.111 ms        0.109 ms          1.02      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0)                0.129 ms        0.125 ms          1.03      
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10)               0.165 ms        0.161 ms          1.03      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0)               0.075 ms        0.065 ms          1.16      
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10)              0.136 ms        0.127 ms          1.07      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0)                4.917 ms        4.792 ms          1.03      
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10)               7.409 ms        7.218 ms          1.03      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0)               0.861 ms        0.516 ms          1.67      
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10)              2.310 ms        2.001 ms          1.15      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0)               5.992 ms        5.865 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10)              8.510 ms        8.316 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0)              1.081 ms        0.639 ms          1.69      
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10)             2.527 ms        2.096 ms          1.21      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0)               9.402 ms        9.220 ms          1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10)              10.667 ms       10.431 ms         1.02      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0)              1.648 ms        0.961 ms          1.72      
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10)             3.068 ms        2.368 ms          1.30      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0)             141.065 ms      139.235 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10)            213.813 ms      211.086 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0)             17.388 ms       10.649 ms         1.63      
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10)            51.688 ms       45.062 ms         1.15      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0)            173.158 ms      170.267 ms         1.02      
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10)           245.782 ms      242.255 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0)            21.924 ms       13.371 ms         1.64      
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10)           56.390 ms       47.730 ms         1.18      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0)            261.662 ms      257.326 ms         1.02      
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10)           287.768 ms      283.550 ms         1.01      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0)            33.363 ms       19.956 ms         1.67      
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10)           67.788 ms       54.432 ms         1.25      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0)          0.034 ms        0.033 ms          1.02      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10)         0.091 ms        0.091 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0)         0.021 ms        0.018 ms          1.13      
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10)        0.069 ms        0.066 ms          1.04      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0)         0.044 ms        0.043 ms          1.04      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10)        0.102 ms        0.100 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0)        0.025 ms        0.022 ms          1.14      
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10)       0.073 ms        0.069 ms          1.05      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0)         0.067 ms        0.065 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10)        0.096 ms        0.093 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0)        0.034 ms        0.028 ms          1.22      
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10)       0.082 ms        0.077 ms          1.08      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0)         2.208 ms        2.111 ms          1.05      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10)        3.359 ms        3.336 ms          1.01      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0)        0.365 ms        0.236 ms          1.55      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10)       1.179 ms        1.025 ms          1.15      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0)        3.027 ms        2.927 ms          1.03      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10)       4.153 ms        4.137 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0)       0.455 ms        0.286 ms          1.59      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10)      1.238 ms        1.058 ms          1.17      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0)        4.884 ms        4.714 ms          1.04      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10)       5.431 ms        5.414 ms          1.00      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0)       0.665 ms        0.402 ms          1.65      
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10)      1.447 ms        1.189 ms          1.22      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0)       62.665 ms       61.630 ms         1.02      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10)      92.328 ms       91.188 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0)      7.406 ms        4.887 ms          1.52      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10)     23.342 ms       20.811 ms         1.12      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0)      83.958 ms       82.595 ms         1.02      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10)    113.672 ms      112.147 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0)     9.215 ms        5.966 ms          1.54      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10)    25.183 ms       21.715 ms         1.16      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0)     134.009 ms      131.432 ms         1.02      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10)    163.457 ms      161.649 ms         1.01      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0)     13.343 ms       8.393 ms          1.59      
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10)    29.310 ms       24.297 ms         1.21      
mself commented 7 years ago

Good catch on not using vars for mask_lo and mask_hi. I copied that idea from some other code in OpenCV, so that might be broken with GCC, too. I guess that clang handles it ok?

One issue on the 2-channel support is that for SSE I only added it for floats. For NEON it's trivial to do it for all types, but for SSE it seems to be quite complicated for some of the integer types (8x32, in particular). So I just did it for float only rather than put in code I wasn't sure about. I doubt there is much need for 2-channel integer support, but I don't like that it is supported for NEON but not for SSE.

Do you know if there is any support for AVX in OpenCV? That could be 2x faster since it has 8-wide float registers (and also 3-operand instructions that eliminate the need for copy instructions, like the movaps you mentioned above). There is also AVX512, but I think that is not available on the most common Intel processors.

mself commented 7 years ago

If GCC supports AVX by default (it does) and GCC is auto-vectorizing the computeError() function for you, then why isn't it generating an AVX version instead of an SSE one? Are there some flags that need to be set to enable AVX in OpenCV? Figuring this out could be a big win...

mself commented 7 years ago

If I use clang -mavx, I do get AVX auto-vectorization. It does a very poor job on the deinterleaving, but GCC will probably do better. But then the binaries can only run on CPUs with AVX. What we want is runtime variants, but I wasn't clear how to do that in a cross-compiler way. GCC has __attribute___((target("avx"))), and clang says it supports that, too, but I got an error when I tried to use it. I looked through existing OpenCV code and didn't see any similar uses.

It would be pretty cool to come up with a simple way in OpenCV to build functions that use AVX when present but also run on non-AVX CPUs. That could speed up a lot of things besides stitching. Or is it better to just build two entire sets of libraries (one with AVX and one without) and have apps dynamically link the right one?

mself commented 7 years ago

I did a little thinking, and my conclusion is that manual vectorization is probably not a good use of effort in OpenCV these days (possibly with some exceptions). Even though clang has limitations, it's sure to get better soon and GCC is already performing better than the manually vectorized code.

More interesting would be to figure out how to selectively enable AVX. It's at the turning point in adoption where most recent computers support it, but not enough to make it a hard requirement. SSE is already required by default with most compilers, but not yet AVX (maybe in a couple of years?).

It seems like you need to do something like this:

if (check_avx_at_runtime()) {
    #pragma enable(avx)     // needs to work across compilers
    loop goes here          // will autovectorize with AVX
    #pragma disable(avx)
} else {
    same loop goes here    // will autovectorize with SSE or NEON, as appropriate for CPU
}

The bummer is that you have two copies of the code to maintain. Could we create a macro that hides this, but isn't too awkward to use for long stretches of code?

This could then be easily added to important vectorizable code segments anywhere in OpenCV.

hrnr commented 7 years ago

1: I think the codegen is ok with GCC. I was just explaining why it generates the warning. For other parts of opencv I don't know, but I got these warning only for your code. But I could have missed something as there were lots of compile units with these warnings.

Yes 2-channel version is probably most useful for deinterleaving Point2f. I think when you open PR with these changes OpenCV gurus will help you with that as this is a cool core feature. :) Also nice that you fixed the documentation.

There is support for AVX see cmake flags ENABLE_AVX ENABLE_AVX2. There is also AVX code in opencv, which is guarded by CV_AVX for example in modules/imgproc/src/accum.cpp. OpenCV however goes generally with SSE2, I think that's the primarly supported platform. AVX512 is currently MIC only. In my experience AVX will not be 2 times faster.

2: I have been running my tests with disabled AVX to get comparable results. Flags: see above.

3: That's why there are everywhere runtime checks for SSE support, even the code is guarded by macros. Distributions needs to build for everyone, so they turning all these features off. OpenCV has SSE enabled by default, but then does all these runtime checks so the people can build OpenCV with SSE and safely push that version as general x86 version for everyone and people can benefit from SSE (now almost everybody has these extensions). I'm not sure if this this is working as distributins still turn these off (for example archlinux.

4: Yes, that is similar to pattern for SSE. I don't think however that will be beneficial for most users. Distributions will turn it off anyway and I'm guessing people considered about performance are building their own versions of OpenCV tailored for their architecture (or they should).

What seems like an interesting to me would the option to disable manually vectorized code and use auto-vectorization, while building of course with all vector extensions enabled. This would need to be done probably on per function basis, but I'm for some function auto-vectorization would be faster than current manually vectorized code, especially when it could use never instructions as most of the OpenCV is vectorized with SSE2.

To sum up. I'm going to change computeError() to work with floats, which makes it faster in all cases. But when this get merged (I hope soon, as GSoC is finishing right now) and the 2-channel deinterleave gets merged feel free to add the vectorized version. I think the latest iteration is very nice and probably it bring speedup for the most users these days.

BTW: We get pretty hardcore in optimizing one function, but are you sure the rest of the ransac code etc. is optimal? Do you have some logs from profiler? I somehow can't believe the real computation is the bottleneck (which would mean the function is optimal). (Also #7101 might be nice for tuning OpenCV.)

mself commented 7 years ago

Thanks for the summary. I will create a PR for the 2-channel support. Should I also create one for the LMedS nth_element() change, or did you include that in your changes?

In terms of tuning, for RANSAC with a large number of points I think that > 50% of the time is spent in computeError(). It is the only part of the algorithm that scales with n. When I added the SSE version, the overall speed went up 2x for large n compared to no vectorization. But I haven't profiled it, which I should. There is also a variant of RANSAC that only evaluates each model with a subset of the points. That could be faster when n is very large.

In your application, how many points do you typically have? In my application (video stabilization) the number of points is only ~200, and 25-50% of those get rejected as outliers due to camera motion between frames (you want the stabilization to lock onto the distant points rather than nearby features that have apparent motion).

Also, what level of accuracy do you need? Do you use the LM refinement or just go with the best model computed with 2 points? The LM part could probably be optimized, as well.

hrnr commented 7 years ago

Thanks for valuable discussion. LMeDs change is not included.

I have between 200-500 points from feature matching. I use LM with estimate* functions and then I run my own LM to optimize the whole system of transformations again. For me estimate* functions are fast enough, there are slower things in my process.

mself commented 7 years ago

OK, I'll make a PR for the LMedS change, too. For me, finding and tracking the features is quite a lot slower than the motion estimation. This just seemed like a fun optimization problem to work on where I could learn more about OpenCV while also learning from someone else working in the same area. Good luck with finishing up your GSOC project!

mself commented 7 years ago

The 2-channel support for universal intrinsics is PR #7182. And the LMedS optimization is PR #7183.

alalek commented 7 years ago

There is merge conflict in modules/calib3d/src/precomp.hpp. Could you resolve it? (or enable "Allow edits from maintainers" option for this PR)

hrnr commented 7 years ago

rebased on the current master. I have also squashed some of the fixup commits and minor changes and reworded some commits with typos.

Let me know if there are some other issues. Thanks for reviewing.

alalek commented 7 years ago

:+1:

alalek commented 7 years ago

@hrnr Could you please resolve conflict again? #7443 was merged first =( There are two commits affected (you may use "git merge" or squash your commits into one before rebase).

hrnr commented 7 years ago

rebased again.