Closed hrnr closed 7 years ago
ORB is apparently thread-unsafe when running with OpenCL.
I have restarted the build once. Last time it passed 1 OCL stitching tests, this time it went through 2 of them. There seems to be a race condition.
Is ORB expected to be thread-unsafe with OCL or should this be fixed? I can disable my parallel feature finding changes with OCL, but I don't know if this is not an actual bug in ORB.
I have disabled parallel feature finding when running with OpenCL. There is not much benefit, because it needs to wait for the device.
This solves the issues with ORB for me, but I'm still not sure if this should be opened as bug or not.
rebase to catch latest changes in master (especially #6962)
FYI: The build is not failing, there is only a warning that patch to opencv_extra is too big (~1MB). Is that a problem? I'd like to add more images.
I believe this is not a problem in this case. When PR will ready please squash all commits into one to keep patch changes clean.
Jiri, I have some comments/suggestions that might be easier to discuss offline. Can you shoot me an email at dmz@mself.com to connect? I did some work to create variants of findHomography() that estimate constrained transformations that have 3 or 4 degrees of freedom rather than the 8 DOF of a full homography. These correspond to (3D) rotations only, and rotations plus uniform scaling. This is similar to your 4 DOF variant. I'd be interested in adding a 3 DOF version that only allows a (2D) rotation plus translation, for example. --Matthew
ok. I think I have solved the the issue with coincident points. haveCollinearPoints
also in fact check coincidence because when coincident, the check will essentialy be 0 <= FLT_EPSILON * (abs(dx2)+abs(dy2))
so it should report coincidence correctly.
I have reworked checkSubset
for estimateAffine*, so that there is no duplicite code. Afterall functions should just be more robust.
I have experimented with solving the system in affinePartial callback analytically. In my experiments it surprisingly runs slower than SVD version.
SVD:
calib3d_posix_x64_5074d25_20160811-122223.xml
Name of Test Number of Number of Min Median Geometric mean Mean Standard deviation
collected samples outliers
EstimateAffine2D::EstimateAffine::(100, 0.9) 38 3 0.15 ms 0.15 ms 0.15 ms 0.15 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.95) 38 3 0.17 ms 0.18 ms 0.18 ms 0.18 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.99) 36 2 0.25 ms 0.25 ms 0.25 ms 0.25 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(100000, 0.9) 10 0 31.49 ms 31.63 ms 31.65 ms 31.65 ms 0.15 ms
EstimateAffine2D::EstimateAffine::(100000, 0.95) 10 0 34.32 ms 34.40 ms 34.47 ms 34.47 ms 0.25 ms
EstimateAffine2D::EstimateAffine::(100000, 0.99) 10 0 41.12 ms 41.34 ms 41.63 ms 41.64 ms 0.81 ms
EstimateAffine2D::EstimateAffine::(5000, 0.9) 13 1 1.23 ms 1.23 ms 1.24 ms 1.24 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(5000, 0.95) 13 1 1.39 ms 1.39 ms 1.40 ms 1.40 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(5000, 0.99) 38 3 1.77 ms 1.78 ms 1.81 ms 1.81 ms 0.05 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.9) 100 8 0.06 ms 0.06 ms 0.06 ms 0.06 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.95) 75 6 0.07 ms 0.07 ms 0.07 ms 0.07 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.99) 63 5 0.08 ms 0.08 ms 0.09 ms 0.09 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9) 10 0 17.02 ms 17.07 ms 17.10 ms 17.10 ms 0.08 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95) 10 0 17.59 ms 17.72 ms 17.73 ms 17.73 ms 0.12 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99) 10 0 20.04 ms 20.10 ms 20.19 ms 20.19 ms 0.22 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9) 13 1 0.50 ms 0.50 ms 0.50 ms 0.50 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95) 13 1 0.58 ms 0.58 ms 0.58 ms 0.58 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99) 13 1 0.71 ms 0.71 ms 0.72 ms 0.72 ms 0.02 ms
analytic:
calib3d_posix_x64_5074d25_20160811-161048.xml
Name of Test Number of Number of Min Median Geometric mean Mean Standard deviation
collected samples outliers
EstimateAffine2D::EstimateAffine::(100, 0.9) 63 5 0.15 ms 0.15 ms 0.15 ms 0.15 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.95) 38 3 0.18 ms 0.18 ms 0.18 ms 0.18 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.99) 33 2 0.25 ms 0.25 ms 0.26 ms 0.26 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(100000, 0.9) 10 0 31.36 ms 31.43 ms 31.56 ms 31.56 ms 0.32 ms
EstimateAffine2D::EstimateAffine::(100000, 0.95) 10 0 34.16 ms 34.27 ms 34.39 ms 34.39 ms 0.35 ms
EstimateAffine2D::EstimateAffine::(100000, 0.99) 10 0 41.20 ms 41.35 ms 42.09 ms 42.11 ms 1.15 ms
EstimateAffine2D::EstimateAffine::(5000, 0.9) 10 0 1.21 ms 1.22 ms 1.23 ms 1.24 ms 0.03 ms
EstimateAffine2D::EstimateAffine::(5000, 0.95) 10 0 1.36 ms 1.37 ms 1.38 ms 1.38 ms 0.03 ms
EstimateAffine2D::EstimateAffine::(5000, 0.99) 13 1 1.76 ms 1.77 ms 1.79 ms 1.79 ms 0.05 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.9) 42 3 0.05 ms 0.05 ms 0.05 ms 0.05 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.95) 25 2 0.06 ms 0.06 ms 0.06 ms 0.06 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.99) 42 3 0.07 ms 0.07 ms 0.07 ms 0.07 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9) 10 0 17.07 ms 17.15 ms 17.19 ms 17.20 ms 0.15 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95) 10 0 17.71 ms 17.84 ms 17.86 ms 17.86 ms 0.12 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99) 10 0 20.26 ms 20.34 ms 20.51 ms 20.51 ms 0.46 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9) 13 1 0.49 ms 0.49 ms 0.50 ms 0.50 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95) 11 0 0.57 ms 0.57 ms 0.58 ms 0.58 ms 0.02 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99) 19 1 0.69 ms 0.70 ms 0.71 ms 0.71 ms 0.02 ms
you can find the code at 57bf2d46cd7b2911cd518a2bdbe745cb43f95da8
That's surprising! If you're up for more experiments, I realized that you can solve the entire kernel analytically without even a matrix multiply. This should be even faster. I can't see how SVD could be faster than this!
double x1 = from[0].x;
double y1 = from[0].y;
double x2 = from[1].x;
double y2 = from[1].y;
double X1 = to[0].x;
double Y1 = to[0].y;
double X2 = to[1].x;
double Y2 = to[1].y;
double d = 1./((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2));
Xdata[0] = d * ( (X1-X2)*(x1-x2) + (Y1-Y2)*(y1-y2) );
Xdata[1] = d * ( (Y1-Y2)*(x1-x2) - (X1-X2)*(y1-y2) );
Xdata[2] = d * ( (Y1-Y2)*(x1*y2 - x2*y1) - (X1*y2 - X2*y1)*(y1-y2) - (X1*x2 - X2*x1)*(x1-x2) );
Xdata[3] = d * (-(X1-X2)*(x1*y2 - x2*y1) - (Y1*x2 - Y2*x1)*(x1-x2) - (Y1*y2 - Y2*y1)*(y1-y2) );
The compiler should be able to optimize all of the common subexpressions and there are no function calls.
Yep, I was also surprised. I think that the kernel is not a bottle neck for the function. But I will try your version, that seems even better.
I will optimize copying inliers, which is currently quite ineficient, that could also speed something up.
I have updated the perf test to use the new API. It seems that lot of time is spent in Levenberg-Marquart refining. RANSAC is only about 1/2 or even 1/3 runtime. LMEDS takes much longer, faster kernel makes more sense here. Here are current results with SVD-based kernels:
calib3d_posix_x64_d9a138e_20160812-152109.xml
Name of Test Number of Number of Min Median Geometric mean Mean Standard deviation
collected samples outliers
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 13 1 0.07 ms 0.07 ms 0.07 ms 0.07 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 73 5 0.10 ms 0.10 ms 0.10 ms 0.10 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 38 3 0.11 ms 0.11 ms 0.11 ms 0.11 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 13 1 0.15 ms 0.16 ms 0.16 ms 0.16 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 50 4 0.07 ms 0.08 ms 0.08 ms 0.08 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 63 5 0.10 ms 0.10 ms 0.10 ms 0.10 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 63 5 0.14 ms 0.14 ms 0.14 ms 0.14 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 25 2 0.17 ms 0.17 ms 0.17 ms 0.17 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 25 2 0.12 ms 0.12 ms 0.13 ms 0.13 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 25 2 0.16 ms 0.16 ms 0.16 ms 0.16 ms 0.00 ms
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 13 1 0.21 ms 0.21 ms 0.21 ms 0.21 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 35 2 0.24 ms 0.24 ms 0.25 ms 0.25 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 10 0 79.69 ms 79.82 ms 79.93 ms 79.93 ms 0.36 ms
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 10 0 137.52 ms 138.79 ms 138.58 ms 138.58 ms 0.54 ms
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 10 0 10.13 ms 10.20 ms 10.25 ms 10.25 ms 0.17 ms
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 10 0 38.84 ms 39.01 ms 39.04 ms 39.04 ms 0.17 ms
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 10 0 97.95 ms 98.12 ms 98.31 ms 98.31 ms 0.72 ms
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 10 0 156.45 ms 157.31 ms 157.32 ms 157.32 ms 0.81 ms
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 10 0 12.92 ms 13.03 ms 13.17 ms 13.17 ms 0.33 ms
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 10 0 41.73 ms 41.84 ms 42.12 ms 42.12 ms 0.44 ms
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 10 0 152.33 ms 152.64 ms 153.70 ms 153.73 ms 2.95 ms
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 10 0 174.49 ms 174.91 ms 175.25 ms 175.25 ms 1.00 ms
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 10 0 20.06 ms 20.22 ms 20.33 ms 20.33 ms 0.29 ms
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 10 0 48.78 ms 49.49 ms 49.80 ms 49.82 ms 1.47 ms
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 10 0 2.98 ms 2.99 ms 3.00 ms 3.00 ms 0.02 ms
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 10 0 4.25 ms 4.37 ms 4.36 ms 4.36 ms 0.08 ms
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 10 0 0.56 ms 0.56 ms 0.57 ms 0.57 ms 0.02 ms
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 13 1 1.21 ms 1.21 ms 1.22 ms 1.22 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 10 0 3.65 ms 3.67 ms 3.68 ms 3.68 ms 0.03 ms
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 10 0 4.82 ms 4.86 ms 4.86 ms 4.86 ms 0.03 ms
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 10 0 0.72 ms 0.72 ms 0.73 ms 0.73 ms 0.02 ms
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 13 1 1.36 ms 1.37 ms 1.38 ms 1.38 ms 0.01 ms
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 10 0 5.71 ms 5.72 ms 5.74 ms 5.74 ms 0.03 ms
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 10 0 6.88 ms 6.92 ms 6.93 ms 6.93 ms 0.06 ms
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 10 0 1.11 ms 1.11 ms 1.12 ms 1.12 ms 0.02 ms
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 13 1 1.75 ms 1.76 ms 1.77 ms 1.77 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 13 1 0.02 ms 0.02 ms 0.03 ms 0.03 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 100 8 0.05 ms 0.06 ms 0.06 ms 0.06 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 13 1 0.03 ms 0.03 ms 0.03 ms 0.03 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 75 6 0.06 ms 0.06 ms 0.06 ms 0.06 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 25 2 0.03 ms 0.03 ms 0.03 ms 0.03 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 88 7 0.06 ms 0.06 ms 0.06 ms 0.06 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 63 5 0.04 ms 0.04 ms 0.04 ms 0.04 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 63 5 0.07 ms 0.07 ms 0.07 ms 0.07 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 13 1 0.05 ms 0.05 ms 0.05 ms 0.05 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 100 8 0.06 ms 0.07 ms 0.07 ms 0.07 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 38 3 0.05 ms 0.06 ms 0.06 ms 0.06 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 50 4 0.08 ms 0.08 ms 0.08 ms 0.08 ms 0.00 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 10 0 36.71 ms 36.79 ms 36.98 ms 36.99 ms 0.44 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 10 0 53.86 ms 53.94 ms 54.06 ms 54.06 ms 0.28 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 13 1 3.89 ms 3.92 ms 3.93 ms 3.93 ms 0.04 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 10 0 14.11 ms 14.25 ms 14.28 ms 14.28 ms 0.16 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 10 0 48.91 ms 48.98 ms 49.02 ms 49.02 ms 0.15 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 10 0 66.05 ms 66.24 ms 66.33 ms 66.33 ms 0.32 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 13 1 5.02 ms 5.05 ms 5.06 ms 5.06 ms 0.03 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 10 0 15.26 ms 15.36 ms 15.40 ms 15.41 ms 0.16 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 10 0 79.02 ms 79.15 ms 79.19 ms 79.19 ms 0.18 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 10 0 96.13 ms 96.40 ms 96.44 ms 96.44 ms 0.28 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 10 0 7.54 ms 7.59 ms 7.66 ms 7.66 ms 0.22 ms
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 10 0 17.79 ms 18.22 ms 18.26 ms 18.27 ms 0.51 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 10 0 1.37 ms 1.38 ms 1.39 ms 1.39 ms 0.02 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 10 0 1.84 ms 1.85 ms 1.86 ms 1.86 ms 0.05 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 13 1 0.20 ms 0.20 ms 0.21 ms 0.21 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 25 2 0.60 ms 0.60 ms 0.61 ms 0.61 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 10 0 1.83 ms 1.90 ms 1.90 ms 1.90 ms 0.03 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 10 0 2.40 ms 2.45 ms 2.45 ms 2.45 ms 0.04 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 13 1 0.26 ms 0.26 ms 0.27 ms 0.27 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 14 1 0.66 ms 0.66 ms 0.67 ms 0.67 ms 0.02 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 10 0 2.96 ms 3.01 ms 3.02 ms 3.03 ms 0.05 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 10 0 3.35 ms 3.47 ms 3.45 ms 3.45 ms 0.06 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 10 0 0.39 ms 0.40 ms 0.40 ms 0.40 ms 0.01 ms
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 13 1 0.79 ms 0.80 ms 0.80 ms 0.80 ms 0.02 ms
OK, cool. Perhaps the reason that findHomography()
has the final runKernel()
call that passes all of the consensus points is to improve the starting estimate for LM so that it takes fewer iterations. I noted that it took ~3 with the runKernel()
call and ~5 without it. The results appeared to be virtually identical, so it may be about performance rather than accuracy. Or maybe it is about stability, since findHomography()
is a lot less numerically stable.
Now that you've updated the APIs, it would be interesting to compare the performance of the analytic runKernel()
with niters = 0
to remove the LM part.
Yes, I will definitely try that. I think it could speed up LMEDS as it takes much more time that RANSAC.
I'm not so sure. I think that LMedS makes roughly the same number of calls to runKernel()
as RANSAC does. I think LMedS is slower because calculating the median error is slower than calculating the average error. I did notice that LMeDSPointSetRegistrator::run()
calls std::sort()
rather than std::nth_element()
, which could be much faster for large point sets since it only does a partial sort instead of a full sort. It's order n
rather than n log n
. For 64 points it could be ~5x faster (although probably less in practice). Probably worth measuring.
Bottom line, while it was nice to make runKernel()
a lot faster, it doesn't appear to be a major factor in the performance of EstimateAffinePartial2D()
.
I have tested the analytical version of kernels. Aligned with previous test, the kernels does not seems to be the bottle neck for functions, but the analytical version is slightly faster. The analytical version is based on suggestions of @mself (thank you), with some typos fixed. I have tuned kernels so that more can be stored in registers and avoid copying model.
I think we can include this. I have extended tests to make sure it is still correct and added some extensive comments to explain what is happening in kernels.
Commits to come.
Geometric mean
Name of Test calib3d calib3d calib3d
posix posix posix
x64 x64 x64
d9a138e 2539bf1 2539bf1
20160812-152109 20160814-114732 20160814-114732
vs
calib3d
posix
x64
d9a138e
20160812-152109
(x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.068 ms 0.032 ms 2.10
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.104 ms 0.074 ms 1.40
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.110 ms 0.029 ms 3.84
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.156 ms 0.068 ms 2.30
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.076 ms 0.040 ms 1.89
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.100 ms 0.059 ms 1.70
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.139 ms 0.036 ms 3.89
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.174 ms 0.072 ms 2.40
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.126 ms 0.069 ms 1.83
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.160 ms 0.092 ms 1.73
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.212 ms 0.052 ms 4.05
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.247 ms 0.083 ms 2.96
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.998 ms 2.957 ms 1.01
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 4.360 ms 4.072 ms 1.07
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.569 ms 0.485 ms 1.17
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 1.217 ms 1.179 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 3.681 ms 3.634 ms 1.01
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 4.855 ms 4.742 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.727 ms 0.618 ms 1.18
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 1.377 ms 1.308 ms 1.05
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 5.740 ms 5.670 ms 1.01
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 6.927 ms 6.775 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 1.118 ms 0.954 ms 1.17
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.767 ms 1.643 ms 1.08
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 79.927 ms 79.734 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 138.580 ms 137.715 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 10.246 ms 9.905 ms 1.03
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 39.036 ms 38.296 ms 1.02
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 98.311 ms 97.922 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 157.320 ms 155.373 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 13.169 ms 12.682 ms 1.04
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 42.121 ms 41.160 ms 1.02
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 153.704 ms 152.232 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 175.252 ms 174.557 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 20.325 ms 19.530 ms 1.04
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 49.804 ms 48.300 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.025 ms 0.014 ms 1.80
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.056 ms 0.047 ms 1.20
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.034 ms 0.013 ms 2.58
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.060 ms 0.042 ms 1.41
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.032 ms 0.018 ms 1.77
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.063 ms 0.050 ms 1.27
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.039 ms 0.015 ms 2.60
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.067 ms 0.042 ms 1.59
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.052 ms 0.030 ms 1.75
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.066 ms 0.047 ms 1.40
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.056 ms 0.020 ms 2.76
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.085 ms 0.048 ms 1.76
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 1.389 ms 1.375 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 1.864 ms 1.989 ms 0.94
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.208 ms 0.186 ms 1.11
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.607 ms 0.618 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 1.901 ms 1.819 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 2.454 ms 2.422 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.265 ms 0.241 ms 1.10
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.671 ms 0.663 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 3.025 ms 2.956 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 3.445 ms 3.256 ms 1.06
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.403 ms 0.364 ms 1.11
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 0.805 ms 0.783 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 36.983 ms 36.766 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 54.060 ms 59.627 ms 0.91
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 3.930 ms 3.819 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 14.283 ms 13.964 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 49.024 ms 48.956 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 66.325 ms 71.767 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 5.056 ms 4.918 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 15.405 ms 15.012 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 79.194 ms 79.118 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 96.442 ms 102.042 ms 0.95
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 7.659 ms 7.415 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 18.260 ms 17.829 ms 1.02
That's great! Thank you for integrating this. It makes sense that the perf improvement is only apparent when the number of points is small. When there are a large number of points, the time is spent evaluating the error of the model rather than generating the model. In my application, the number of points is always <= 200, so this improvement is quite significant.
A much larger performance improvement can be made for LMedS in LMeDSPointSetRegistrator::run()
by replacing
std::sort(errf.ptr<int>(), errf.ptr<int>() + count);
double median = count % 2 != 0 ?
errf.at<float>(count/2) : (errf.at<float>(count/2-1) + errf.at<float>(count/2))*0.5;
with
std::nth_element(errf.ptr<int>(), errf.ptr<int>() + count/2, errf.ptr<int>() + count);
double median = errf.at<float>(count/2);
It reduces the run time from n log n
to n
, so it has the most impact on large points sets. It makes LMedS run up to 5x faster for the largest perf test:
Geometric mean
Name of Test calib3d calib3d calib3d
posix posix posix
x64 x64 x64
86e6f89 86e6f89 86e6f89
20160814-152408 20160814-153654 20160814-153654
vs
calib3d
posix
x64
86e6f89
20160814-152408
(x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.025 ms 0.014 ms 1.73
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.057 ms 0.045 ms 1.26
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.026 ms 0.026 ms 1.00
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.050 ms 0.050 ms 1.00
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.031 ms 0.017 ms 1.86
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.047 ms 0.031 ms 1.49
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.031 ms 0.032 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.055 ms 0.056 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.054 ms 0.024 ms 2.22
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.071 ms 0.039 ms 1.80
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.046 ms 0.046 ms 1.00
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.070 ms 0.070 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.557 ms 0.639 ms 4.00
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 3.544 ms 1.694 ms 2.09
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.437 ms 0.440 ms 0.99
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.988 ms 0.990 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 3.116 ms 0.773 ms 4.03
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 4.101 ms 1.808 ms 2.27
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.553 ms 0.556 ms 0.99
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 1.098 ms 1.120 ms 0.98
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 4.861 ms 1.207 ms 4.03
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 5.808 ms 2.210 ms 2.63
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.842 ms 0.852 ms 0.99
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.418 ms 1.436 ms 0.99
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 68.418 ms 12.655 ms 5.41
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 120.279 ms 63.391 ms 1.90
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 8.614 ms 8.565 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 23.824 ms 23.092 ms 1.03
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 87.441 ms 15.609 ms 5.60
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 140.613 ms 67.021 ms 2.10
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 10.770 ms 10.712 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 26.324 ms 25.426 ms 1.04
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 132.352 ms 22.916 ms 5.78
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 146.041 ms 37.414 ms 3.90
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 15.795 ms 15.887 ms 0.99
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 30.257 ms 31.005 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.012 ms 0.009 ms 1.36
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.035 ms 0.032 ms 1.10
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.012 ms 0.012 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.032 ms 0.032 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.015 ms 0.011 ms 1.43
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.038 ms 0.033 ms 1.15
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.014 ms 0.014 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.034 ms 0.035 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.024 ms 0.014 ms 1.67
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.036 ms 0.025 ms 1.43
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.019 ms 0.019 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.039 ms 0.039 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 1.255 ms 0.321 ms 3.91
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 1.750 ms 0.873 ms 2.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.182 ms 0.183 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.533 ms 0.539 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 1.621 ms 0.402 ms 4.04
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 2.141 ms 0.956 ms 2.24
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.229 ms 0.230 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.580 ms 0.587 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 2.560 ms 0.633 ms 4.04
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 2.809 ms 0.895 ms 3.14
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.346 ms 0.337 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 0.685 ms 0.685 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 31.660 ms 5.983 ms 5.29
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 46.766 ms 19.892 ms 2.35
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 3.758 ms 3.526 ms 1.07
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 9.754 ms 9.619 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 43.937 ms 7.550 ms 5.82
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 56.864 ms 21.212 ms 2.68
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 4.469 ms 4.478 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 10.243 ms 10.479 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 67.980 ms 12.035 ms 5.65
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 81.435 ms 25.899 ms 3.14
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 6.347 ms 6.354 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 12.225 ms 12.532 ms 0.98
With the change, LMedS is never more than ~2x slower than RANSAC. For the 100 point tests, it's now faster than RANSAC.
You can also get a 5-10% overall speedup by changing Affine2DEstimatorCallback::computeError()
to use float
instead of double
for the intermediate results. The result is a float
, in any case.
Note that HomographyEstimatorCallback::computeError()
already uses float
like this.
float F0 = F[0], F1 = F[1], F2 = F[2], F3 = F[3], F4 = F[4], F5 = F[5];
for(int i = 0; i < count; i++ )
{
const Point2f& f = from[i];
const Point2f& t = to[i];
float a = F0*f.x + F1*f.y + F2 - t.x;
float b = F3*f.x + F4*f.y + F5 - t.y;
errptr[i] = a*a + b*b;
}
Is there a way that this could be vectorized with SSE? That could make a really significant difference.
I tried writing an SSE version of Affine2DEstimatorCallback ::computeError()', since it seems to be the bottleneck for
estimateAffine2D(). The SSE version increases the overall performance of
estimateAffine2D()by 10-20% in most cases compared to the
float` version above. In some cases, it increased overall performance by 2x!
void computeError( InputArray _m1, InputArray _m2, InputArray _model, OutputArray _err ) const
{
Mat m1 = _m1.getMat(), m2 = _m2.getMat(), model = _model.getMat();
const Point2f* from = m1.ptr<Point2f>();
const Point2f* to = m2.ptr<Point2f>();
const double* F = model.ptr<double>();
int count = m1.checkVector(2);
CV_Assert( count > 0 );
_err.create(count, 1, CV_32F);
Mat err = _err.getMat();
float* errptr = err.ptr<float>();
float F0 = F[0], F1 = F[1], F2 = F[2], F3 = F[3], F4 = F[4], F5 = F[5];
#if CV_SSE2
if( checkHardwareSupport(CV_CPU_SSE2))
{
int i;
// Load 4 copies of each model param into registers
const __m128 mm_F0 = _mm_set1_ps(F0), mm_F1 = _mm_set1_ps(F1), mm_F2 = _mm_set1_ps(F2);
const __m128 mm_F3 = _mm_set1_ps(F3), mm_F4 = _mm_set1_ps(F4), mm_F5 = _mm_set1_ps(F5);
if ((( (intptr_t)from & 0xf ) == 0) && (( (intptr_t)to & 0xf ) == 0) && (( (intptr_t)errptr & 0xf ) == 0))
{
// Aligned case - use _mm_load_ps() and _mm_store_ps()
for(i = 0; i < count - 3; i += 4 )
{
// Load 4 'from' points into two registers
const __m128 mm_from_0 = _mm_load_ps(&from[i].x);
const __m128 mm_from_2 = _mm_load_ps(&from[i+2].x);
// Shuffle the x values into one register and the y values into another
const __m128 mm_fx = _mm_shuffle_ps(mm_from_0, mm_from_2, _MM_SHUFFLE(2, 0, 2, 0));
const __m128 mm_fy = _mm_shuffle_ps(mm_from_0, mm_from_2, _MM_SHUFFLE(3, 1, 3, 1));
// Repeat for the 'to' points
const __m128 mm_to_0 = _mm_load_ps(&to[i].x);
const __m128 mm_to_2 = _mm_load_ps(&to[i+2].x);
const __m128 mm_tx = _mm_shuffle_ps(mm_to_0, mm_to_2, _MM_SHUFFLE(2, 0, 2, 0));
const __m128 mm_ty = _mm_shuffle_ps(mm_to_0, mm_to_2, _MM_SHUFFLE(3, 1, 3, 1));
// Compute error for 4 points at a time
const __m128 mm_a = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F0, mm_fx),
_mm_mul_ps(mm_F1, mm_fy)),
_mm_sub_ps(mm_F2, mm_tx));
const __m128 mm_b = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F3, mm_fx),
_mm_mul_ps(mm_F4, mm_fy)),
_mm_sub_ps(mm_F5, mm_ty));
// Store 4 results
_mm_store_ps(&errptr[i], _mm_add_ps(_mm_mul_ps(mm_a, mm_a), _mm_mul_ps(mm_b, mm_b)));
}
}
else
{
// Unaligned case - use _mm_loadu_ps() and _mm_storeu_ps()
for(i = 0; i < count - 3; i += 4 )
{
const __m128 mm_from01 = _mm_loadu_ps(&from[i].x);
const __m128 mm_from23 = _mm_loadu_ps(&from[i+2].x);
const __m128 mm_fx = _mm_shuffle_ps(mm_from01, mm_from23, _MM_SHUFFLE(2, 0, 2, 0));
const __m128 mm_fy = _mm_shuffle_ps(mm_from01, mm_from23, _MM_SHUFFLE(3, 1, 3, 1));
const __m128 mm_to01 = _mm_loadu_ps(&to[i].x);
const __m128 mm_to23 = _mm_loadu_ps(&to[i+2].x);
const __m128 mm_tx = _mm_shuffle_ps(mm_to01, mm_to23, _MM_SHUFFLE(2, 0, 2, 0));
const __m128 mm_ty = _mm_shuffle_ps(mm_to01, mm_to23, _MM_SHUFFLE(3, 1, 3, 1));
const __m128 mm_a = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F0, mm_fx),
_mm_mul_ps(mm_F1, mm_fy)),
_mm_sub_ps(mm_F2, mm_tx));
const __m128 mm_b = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mm_F3, mm_fx),
_mm_mul_ps(mm_F4, mm_fy)),
_mm_sub_ps(mm_F5, mm_ty));
_mm_storeu_ps(&errptr[i], _mm_add_ps(_mm_mul_ps(mm_a, mm_a), _mm_mul_ps(mm_b, mm_b)));
}
}
// Finish any remaining points
for( ; i < count; i++ )
{
const Point2f& f = from[i];
const Point2f& t = to[i];
float a = F0*f.x + F1*f.y + F2 - t.x;
float b = F3*f.x + F4*f.y + F5 - t.y;
errptr[i] = a*a + b*b;
}
}
else
#endif
{
for(int i = 0; i < count; i++ )
{
const Point2f& f = from[i];
const Point2f& t = to[i];
float a = F0*f.x + F1*f.y + F2 - t.x;
float b = F3*f.x + F4*f.y + F5 - t.y;
errptr[i] = a*a + b*b;
}
}
}
Here are the perf results:
Geometric mean
Name of Test calib3d calib3d calib3d calib3d
posix posix posix posix
x64 x64 x64 x64
86e6f89 86e6f89 86e6f89 86e6f89
20160814-nosse 20160815-005157 20160815-005157 20160815-005157
vs vs
calib3d calib3d
posix posix
x64 x64
86e6f89 86e6f89
20160814-nosse 20160814-nosse
(x-factor) (score)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.014 ms 0.013 ms 1.12 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.042 ms 0.041 ms 1.03
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.024 ms 0.019 ms 1.22 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.047 ms 0.044 ms 1.08 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.016 ms 0.014 ms 1.16 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.031 ms 0.029 ms 1.08 faster
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.029 ms 0.023 ms 1.22 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.053 ms 0.048 ms 1.10 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.024 ms 0.020 ms 1.19 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.039 ms 0.036 ms 1.08 faster
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.042 ms 0.034 ms 1.24 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.067 ms 0.058 ms 1.14 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 0.578 ms 0.519 ms 1.11 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 1.618 ms 1.552 ms 1.04
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.393 ms 0.197 ms 1.99 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.940 ms 0.746 ms 1.26 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 0.730 ms 0.626 ms 1.17 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 1.757 ms 1.674 ms 1.05
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.495 ms 0.245 ms 2.02 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 1.045 ms 0.793 ms 1.32 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 1.141 ms 0.978 ms 1.17 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 1.612 ms 1.478 ms 1.09 faster
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.756 ms 0.364 ms 2.08 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.303 ms 0.916 ms 1.42 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 11.808 ms 10.525 ms 1.12 faster
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 59.428 ms 58.376 ms 1.02
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 7.486 ms 3.913 ms 1.91 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 22.308 ms 18.235 ms 1.22 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 14.539 ms 13.032 ms 1.12 faster
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 64.788 ms 64.195 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 9.245 ms 4.714 ms 1.96 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 24.037 ms 20.133 ms 1.19 faster
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 21.544 ms 19.227 ms 1.12 faster
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 35.347 ms 33.207 ms 1.06 faster
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 13.720 ms 6.850 ms 2.00 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 28.917 ms 21.459 ms 1.35 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.009 ms 0.008 ms 1.14 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.032 ms 0.031 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.011 ms 0.010 ms 1.16 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.031 ms 0.030 ms 1.05 faster
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.011 ms 0.009 ms 1.16 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.033 ms 0.032 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.013 ms 0.011 ms 1.17 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.033 ms 0.031 ms 1.05 faster
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.014 ms 0.012 ms 1.15 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.026 ms 0.025 ms 1.07 faster
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.018 ms 0.015 ms 1.21 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.038 ms 0.035 ms 1.09 faster
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 0.305 ms 0.260 ms 1.17 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 0.834 ms 0.804 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.167 ms 0.093 ms 1.80 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.514 ms 0.447 ms 1.15 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 0.385 ms 0.330 ms 1.17 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 0.914 ms 0.863 ms 1.06 faster
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.207 ms 0.112 ms 1.85 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.557 ms 0.459 ms 1.21 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 0.611 ms 0.517 ms 1.18 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 0.857 ms 0.763 ms 1.12 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.300 ms 0.155 ms 1.94 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 0.654 ms 0.508 ms 1.29 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 5.817 ms 5.101 ms 1.14 faster
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 19.439 ms 18.683 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 3.140 ms 1.789 ms 1.76 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 9.077 ms 7.911 ms 1.15 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 7.418 ms 6.648 ms 1.12 faster
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 21.130 ms 19.977 ms 1.06
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 3.923 ms 2.171 ms 1.81 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 9.810 ms 8.307 ms 1.18 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 11.445 ms 10.283 ms 1.11 faster
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 25.540 ms 23.906 ms 1.07 faster
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 5.537 ms 2.965 ms 1.87 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 11.431 ms 9.099 ms 1.26 FASTER
I haven't written an SSE function before, so someone with more experience might be able to improve it. I didn't see much support for AVX in OpenCV, but that could increase the stride from 4 to 8 points per iteration.
LMEDS changes looks great, sorting just for median is insane. This is definitely for another PR. This might finally make it usable. I think nobody noticed just because there are no perf tests for findHomograhy
and friends using LMEDS.
Error computing using float seems ok, I tested something and it didn't have any impact on precision, at least in my case. For vectorized version, I think we should use universal intristics (CV_SIMD128
) to also support NEON.
I was playing with vectorized version and I can't reproduce your results. I got this for your version.
Geometric mean
Name of Test calib3d calib3d calib3d
posix posix posix
x64 x64 x64
86e6f89 18b2e2d 18b2e2d
20160815-165658 20160815-171110 20160815-171110
vs
calib3d
posix
x64
86e6f89
20160815-165658
(x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.028 ms 0.029 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.066 ms 0.070 ms 0.94
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.022 ms 0.023 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.055 ms 0.055 ms 1.00
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.032 ms 0.036 ms 0.91
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.054 ms 0.054 ms 1.00
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.027 ms 0.028 ms 0.97
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.059 ms 0.064 ms 0.92
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.063 ms 0.064 ms 0.99
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.081 ms 0.086 ms 0.94
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.039 ms 0.040 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.071 ms 0.072 ms 0.98
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.776 ms 2.786 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 3.940 ms 3.947 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.244 ms 0.284 ms 0.86
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.911 ms 0.914 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 3.393 ms 3.400 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 4.557 ms 4.701 ms 0.97
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.302 ms 0.351 ms 0.86
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.966 ms 1.033 ms 0.94
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 5.320 ms 5.320 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 5.918 ms 5.905 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.442 ms 0.492 ms 0.90
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.099 ms 1.121 ms 0.98
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 81.415 ms 81.188 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 137.124 ms 139.951 ms 0.98
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 5.078 ms 5.587 ms 0.91
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 33.010 ms 34.542 ms 0.96
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 99.826 ms 99.712 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 155.981 ms 158.494 ms 0.98
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 6.331 ms 6.866 ms 0.92
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 34.323 ms 35.480 ms 0.97
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 151.042 ms 150.811 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 173.142 ms 173.321 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 9.441 ms 10.335 ms 0.91
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 37.718 ms 40.089 ms 0.94
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.013 ms 0.013 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.045 ms 0.044 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.012 ms 0.010 ms 1.11
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.038 ms 0.040 ms 0.95
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.016 ms 0.017 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.048 ms 0.048 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.013 ms 0.012 ms 1.06
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.040 ms 0.040 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.027 ms 0.027 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.044 ms 0.045 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.017 ms 0.016 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.044 ms 0.044 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 1.242 ms 1.249 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 1.811 ms 1.890 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.116 ms 0.125 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.520 ms 0.495 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 1.712 ms 1.704 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 2.265 ms 2.377 ms 0.95
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.139 ms 0.151 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.544 ms 0.522 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 2.764 ms 2.769 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 3.057 ms 3.093 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.190 ms 0.210 ms 0.91
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 0.595 ms 0.578 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 35.952 ms 35.912 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 58.530 ms 58.241 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 2.346 ms 2.528 ms 0.93
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 12.419 ms 12.831 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 48.274 ms 48.219 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 70.782 ms 70.575 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 2.852 ms 3.058 ms 0.93
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 12.931 ms 13.398 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 77.203 ms 77.853 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 99.982 ms 99.386 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 3.953 ms 4.266 ms 0.93
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 14.084 ms 14.528 ms 0.97
which is not very convincing.
Probably my compiler could vectorize the original loop better than we humans? I'm using gcc 6.1.1. Or could this be my machine?
I have implemented my vectorized version, but that is also slower 18b2e2de112f7780e604caa99b3ac6da685421ff.
I re-ran the perf test to double check and got the same results. I am running on a Mac with an Intel Core i5. I used gdb to disassemble the plain version and it was not using any vector instructions. Perhaps that's not enabled for me (or the compiler is choosing not to vectorize this loop for some reason).
So either the plain version is faster on your compiler so there is no difference for you, or the vector version isn't running properly for you. Is ENABLE_SSE2 set in CMake on your system? I tried putting a couple of printfs into the vector version, which verified that the aligned version was always being used for me.
interesting, can you get also some improvement with 18b2e2de112f7780e604caa99b3ac6da685421ff version?
I will try to test on different machine with different compiler.
I ran your SIMD version and got a similar speed boost to what I got from the SSE version (up to 2x faster!). I also made a tweak to it (see the line comments) that helped another ~10% on some tests. If that version is no slower for you, then perhaps we can include it for those users who are seeing the benefit. It would be good to understand why you don't see any boost.
Ok I have tested also on Xeon E7540 with older gcc 4.9.3. But even the older gcc can vectorize the loop for me I got roughly:
175: f2 41 0f 5a 5d 08 cvtsd2ss 0x8(%r13),%xmm3
17b: 66 45 0f ef c0 pxor %xmm8,%xmm8
180: f2 41 0f 5a 6d 10 cvtsd2ss 0x10(%r13),%xmm5
186: f2 41 0f 5a 75 18 cvtsd2ss 0x18(%r13),%xmm6
18c: f2 41 0f 5a 7d 20 cvtsd2ss 0x20(%r13),%xmm7
192: f2 45 0f 5a 45 28 cvtsd2ss 0x28(%r13),%xmm8
198: 0f 8e 6f 02 00 00 jle 40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
19e: 49 63 cc movslq %r12d,%rcx
1a1: 48 8d 04 cd 00 00 00 lea 0x0(,%rcx,8),%rax
1a8: 00
1a9: 48 8d 34 8a lea (%rdx,%rcx,4),%rsi
1ad: 48 8d 0c 03 lea (%rbx,%rax,1),%rcx
1b1: 48 39 f3 cmp %rsi,%rbx
1b4: 40 0f 93 c7 setae %dil
1b8: 48 39 ca cmp %rcx,%rdx
1bb: 0f 93 c1 setae %cl
1be: 09 f9 or %edi,%ecx
1c0: 48 8b bd 28 fe ff ff mov -0x1d8(%rbp),%rdi
1c7: 48 39 f7 cmp %rsi,%rdi
1ca: 40 0f 93 c6 setae %sil
1ce: 48 01 f8 add %rdi,%rax
1d1: 48 39 c2 cmp %rax,%rdx
1d4: 0f 93 c0 setae %al
1d7: 09 f0 or %esi,%eax
1d9: 84 c1 test %al,%cl
1db: 0f 84 a7 08 00 00 je a88 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0xa88>
1e1: 41 83 fc 03 cmp $0x3,%r12d
1e5: 0f 86 9d 08 00 00 jbe a88 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0xa88>
1eb: 0f 28 d7 movaps %xmm7,%xmm2
1ee: 41 0f 28 c8 movaps %xmm8,%xmm1
1f2: 41 8d 74 24 fc lea -0x4(%r12),%esi
1f7: 44 0f 28 f4 movaps %xmm4,%xmm14
1fb: 0f c6 d2 00 shufps $0x0,%xmm2,%xmm2
1ff: 4c 8b 95 28 fe ff ff mov -0x1d8(%rbp),%r10
206: 0f c6 c9 00 shufps $0x0,%xmm1,%xmm1
20a: c1 ee 02 shr $0x2,%esi
20d: 44 0f 28 eb movaps %xmm3,%xmm13
211: 83 c6 01 add $0x1,%esi
214: 44 0f 28 e5 movaps %xmm5,%xmm12
218: 8d 0c b5 00 00 00 00 lea 0x0(,%rsi,4),%ecx
21f: 44 0f 28 de movaps %xmm6,%xmm11
223: 31 c0 xor %eax,%eax
225: 45 0f c6 f6 00 shufps $0x0,%xmm14,%xmm14
22a: 31 ff xor %edi,%edi
22c: 45 0f c6 ed 00 shufps $0x0,%xmm13,%xmm13
231: 45 0f c6 e4 00 shufps $0x0,%xmm12,%xmm12
236: 45 0f c6 db 00 shufps $0x0,%xmm11,%xmm11
23b: 0f 29 95 f0 fd ff ff movaps %xmm2,-0x210(%rbp)
242: 0f 29 8d 00 fe ff ff movaps %xmm1,-0x200(%rbp)
249: 83 c7 01 add $0x1,%edi
24c: 0f 10 4c 43 10 movups 0x10(%rbx,%rax,2),%xmm1
251: 0f 10 04 43 movups (%rbx,%rax,2),%xmm0
255: 44 0f 28 c8 movaps %xmm0,%xmm9
259: 0f c6 c1 dd shufps $0xdd,%xmm1,%xmm0
25d: 41 0f 10 14 42 movups (%r10,%rax,2),%xmm2
262: 44 0f c6 c9 88 shufps $0x88,%xmm1,%xmm9
267: 45 0f 10 54 42 10 movups 0x10(%r10,%rax,2),%xmm10
26d: 0f 28 c8 movaps %xmm0,%xmm1
270: 0f 59 85 f0 fd ff ff mulps -0x210(%rbp),%xmm0
277: 45 0f 28 f9 movaps %xmm9,%xmm15
27b: 45 0f 59 cb mulps %xmm11,%xmm9
27f: 45 0f 59 fe mulps %xmm14,%xmm15
283: 41 0f 59 cd mulps %xmm13,%xmm1
287: 44 0f 58 c8 addps %xmm0,%xmm9
28b: 41 0f 58 cf addps %xmm15,%xmm1
28f: 44 0f 28 fa movaps %xmm2,%xmm15
293: 44 0f 58 8d 00 fe ff addps -0x200(%rbp),%xmm9
29a: ff
29b: 41 0f c6 d2 dd shufps $0xdd,%xmm10,%xmm2
2a0: 45 0f c6 fa 88 shufps $0x88,%xmm10,%xmm15
2a5: 41 0f 58 cc addps %xmm12,%xmm1
2a9: 44 0f 5c ca subps %xmm2,%xmm9
2ad: 41 0f 5c cf subps %xmm15,%xmm1
2b1: 45 0f 59 c9 mulps %xmm9,%xmm9
2b5: 0f 59 c9 mulps %xmm1,%xmm1
2b8: 41 0f 58 c9 addps %xmm9,%xmm1
2bc: 0f 11 0c 02 movups %xmm1,(%rdx,%rax,1)
2c0: 48 83 c0 10 add $0x10,%rax
2c4: 39 fe cmp %edi,%esi
2c6: 77 81 ja 249 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x249>
2c8: 41 39 cc cmp %ecx,%r12d
2cb: 0f 84 3c 01 00 00 je 40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
2d1: 48 63 f1 movslq %ecx,%rsi
2d4: 4c 8b 9d 28 fe ff ff mov -0x1d8(%rbp),%r11
2db: 48 8d 04 f5 00 00 00 lea 0x0(,%rsi,8),%rax
2e2: 00
2e3: 48 8d 3c 03 lea (%rbx,%rax,1),%rdi
2e7: 4c 01 d8 add %r11,%rax
2ea: f3 0f 10 07 movss (%rdi),%xmm0
2ee: f3 0f 10 4f 04 movss 0x4(%rdi),%xmm1
2f3: 44 0f 28 c8 movaps %xmm0,%xmm9
2f7: 0f 28 d1 movaps %xmm1,%xmm2
2fa: f3 0f 59 c6 mulss %xmm6,%xmm0
2fe: f3 0f 59 cf mulss %xmm7,%xmm1
302: f3 44 0f 59 cc mulss %xmm4,%xmm9
307: f3 0f 59 d3 mulss %xmm3,%xmm2
30b: f3 0f 58 c8 addss %xmm0,%xmm1
30f: f3 41 0f 58 d1 addss %xmm9,%xmm2
314: f3 41 0f 58 c8 addss %xmm8,%xmm1
319: f3 0f 58 d5 addss %xmm5,%xmm2
31d: 0f 28 c1 movaps %xmm1,%xmm0
320: f3 0f 5c 10 subss (%rax),%xmm2
324: f3 0f 5c 40 04 subss 0x4(%rax),%xmm0
329: f3 0f 59 d2 mulss %xmm2,%xmm2
32d: f3 0f 59 c0 mulss %xmm0,%xmm0
331: f3 0f 58 c2 addss %xmm2,%xmm0
335: f3 0f 11 04 b2 movss %xmm0,(%rdx,%rsi,4)
33a: 8d 71 01 lea 0x1(%rcx),%esi
33d: 41 39 f4 cmp %esi,%r12d
340: 0f 8e c7 00 00 00 jle 40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
346: 48 63 f6 movslq %esi,%rsi
349: 83 c1 02 add $0x2,%ecx
34c: 48 8d 04 f5 00 00 00 lea 0x0(,%rsi,8),%rax
353: 00
354: 48 8d 3c 03 lea (%rbx,%rax,1),%rdi
358: 4c 01 d8 add %r11,%rax
35b: 41 39 cc cmp %ecx,%r12d
35e: f3 0f 10 07 movss (%rdi),%xmm0
362: f3 0f 10 4f 04 movss 0x4(%rdi),%xmm1
367: 44 0f 28 c8 movaps %xmm0,%xmm9
36b: 0f 28 d1 movaps %xmm1,%xmm2
36e: f3 0f 59 c6 mulss %xmm6,%xmm0
372: f3 0f 59 cf mulss %xmm7,%xmm1
376: f3 44 0f 59 cc mulss %xmm4,%xmm9
37b: f3 0f 59 d3 mulss %xmm3,%xmm2
37f: f3 0f 58 c8 addss %xmm0,%xmm1
383: f3 41 0f 58 d1 addss %xmm9,%xmm2
388: f3 41 0f 58 c8 addss %xmm8,%xmm1
38d: f3 0f 58 d5 addss %xmm5,%xmm2
391: f3 0f 5c 48 04 subss 0x4(%rax),%xmm1
396: f3 0f 5c 10 subss (%rax),%xmm2
39a: 0f 28 c1 movaps %xmm1,%xmm0
39d: f3 0f 59 d2 mulss %xmm2,%xmm2
3a1: f3 0f 59 c1 mulss %xmm1,%xmm0
3a5: f3 0f 58 c2 addss %xmm2,%xmm0
3a9: f3 0f 11 04 b2 movss %xmm0,(%rdx,%rsi,4)
3ae: 7e 5d jle 40d <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x40d>
3b0: 48 63 c9 movslq %ecx,%rcx
3b3: 48 8d 04 cd 00 00 00 lea 0x0(,%rcx,8),%rax
3ba: 00
3bb: 48 01 c3 add %rax,%rbx
3be: 48 03 85 28 fe ff ff add -0x1d8(%rbp),%rax
3c5: f3 0f 10 0b movss (%rbx),%xmm1
3c9: f3 0f 10 43 04 movss 0x4(%rbx),%xmm0
3ce: f3 0f 59 e1 mulss %xmm1,%xmm4
3d2: f3 0f 59 d8 mulss %xmm0,%xmm3
3d6: f3 0f 59 f1 mulss %xmm1,%xmm6
3da: f3 0f 59 f8 mulss %xmm0,%xmm7
3de: f3 0f 58 dc addss %xmm4,%xmm3
3e2: f3 0f 58 fe addss %xmm6,%xmm7
3e6: f3 0f 58 eb addss %xmm3,%xmm5
3ea: f3 44 0f 58 c7 addss %xmm7,%xmm8
3ef: f3 0f 5c 28 subss (%rax),%xmm5
3f3: f3 44 0f 5c 40 04 subss 0x4(%rax),%xmm8
3f9: f3 0f 59 ed mulss %xmm5,%xmm5
3fd: f3 45 0f 59 c0 mulss %xmm8,%xmm8
402: f3 44 0f 58 c5 addss %xmm5,%xmm8
407: f3 44 0f 11 04 8a movss %xmm8,(%rdx,%rcx,4)
for our function. It uses only SSE, but it IMHO it did quite a good job and unrolled the loop quite agressively. Even we can't beat it:
Geometric mean
Name of Test calib3d calib3d calib3d
posix posix posix
x64 x64 x64
18b2e2d d56de49 d56de49
20160815-201729 20160815-204020 20160815-204020
vs
calib3d
posix
x64
18b2e2d
20160815-201729
(x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.063 ms 0.064 ms 0.99
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.136 ms 0.140 ms 0.97
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.041 ms 0.042 ms 0.97
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.106 ms 0.109 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.076 ms 0.077 ms 0.99
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.116 ms 0.118 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.051 ms 0.052 ms 0.97
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.115 ms 0.119 ms 0.97
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.122 ms 0.121 ms 1.00
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.161 ms 0.164 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.074 ms 0.077 ms 0.96
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.139 ms 0.143 ms 0.97
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 4.610 ms 4.644 ms 0.99
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 6.933 ms 6.949 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.755 ms 0.787 ms 0.96
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 2.192 ms 2.226 ms 0.98
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 5.703 ms 5.583 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 7.971 ms 7.961 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.953 ms 0.987 ms 0.97
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 2.373 ms 2.411 ms 0.98
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 8.842 ms 8.779 ms 1.01
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 9.989 ms 9.963 ms 1.00
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 1.445 ms 1.495 ms 0.97
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 2.852 ms 2.931 ms 0.97
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 132.902 ms 131.835 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 196.302 ms 196.587 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 14.889 ms 15.431 ms 0.96
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 45.945 ms 46.190 ms 0.99
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 162.923 ms 162.030 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 227.114 ms 226.551 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 18.786 ms 19.563 ms 0.96
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 49.712 ms 50.631 ms 0.98
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 247.388 ms 245.191 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 270.449 ms 267.962 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 28.617 ms 29.760 ms 0.96
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 59.735 ms 61.002 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.034 ms 0.034 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.095 ms 0.097 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.019 ms 0.019 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.073 ms 0.076 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.043 ms 0.042 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.105 ms 0.107 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.023 ms 0.023 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.076 ms 0.079 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.065 ms 0.065 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.098 ms 0.098 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.031 ms 0.031 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.085 ms 0.087 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.050 ms 2.046 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 3.307 ms 3.242 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.323 ms 0.337 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 1.140 ms 1.143 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 2.794 ms 2.818 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 4.080 ms 3.965 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.404 ms 0.418 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 1.203 ms 1.191 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 4.566 ms 4.530 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 5.180 ms 5.149 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.578 ms 0.604 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.383 ms 1.424 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 58.952 ms 58.513 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 84.726 ms 84.658 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 6.283 ms 6.559 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 19.711 ms 19.652 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 79.084 ms 78.624 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 104.451 ms 104.073 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 7.823 ms 8.225 ms 0.95
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 20.907 ms 21.467 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 126.398 ms 125.661 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 152.708 ms 151.957 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 11.366 ms 11.774 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 24.299 ms 25.473 ms 0.95
But on the other hand we are at least not slower.
I'm not against including this, if you can confirm the speedup. (I think we can call it optimal if we are the speed of gcc :) It can also run on NEON and compilers didn't used to be that good there, but I don't have an ARM to test it.
BTW what compiler are you using?
I have also tested on my laptop with gcc 6.1.1 and Core i5-2520M and we are ~15% slower than gcc in cases where error computing matters the most (RANSAC without refining):
Geometric mean
Name of Test calib3d calib3d calib3d
posix posix posix
x64 x64 x64
150daa2 150daa2 150daa2
20160815-214220 20160815-213642 20160815-213642
vs
calib3d
posix
x64
150daa2
20160815-214220
(x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.027 ms 0.026 ms 1.02
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.066 ms 0.068 ms 0.97
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.022 ms 0.022 ms 1.01
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.055 ms 0.056 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.033 ms 0.032 ms 1.02
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.054 ms 0.055 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.026 ms 0.026 ms 0.99
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.059 ms 0.060 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.057 ms 0.060 ms 0.96
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.080 ms 0.084 ms 0.96
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.037 ms 0.039 ms 0.95
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.070 ms 0.073 ms 0.97
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.786 ms 2.712 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 3.920 ms 3.888 ms 1.01
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.242 ms 0.298 ms 0.81
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.916 ms 0.942 ms 0.97
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 3.403 ms 3.313 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 4.518 ms 4.474 ms 1.01
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.300 ms 0.364 ms 0.83
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.966 ms 1.006 ms 0.96
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 5.325 ms 5.215 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 5.917 ms 5.796 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.443 ms 0.540 ms 0.82
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.111 ms 1.188 ms 0.94
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 81.191 ms 78.670 ms 1.03
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 137.046 ms 146.944 ms 0.93
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 5.141 ms 6.478 ms 0.79
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 33.305 ms 37.155 ms 0.90
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 100.009 ms 101.746 ms 0.98
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 155.763 ms 155.185 ms 1.00
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 6.407 ms 7.555 ms 0.85
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 34.567 ms 36.710 ms 0.94
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 150.923 ms 146.320 ms 1.03
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 172.952 ms 169.110 ms 1.02
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 9.502 ms 11.193 ms 0.85
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 37.766 ms 40.142 ms 0.94
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.013 ms 0.013 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.045 ms 0.048 ms 0.93
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.010 ms 0.010 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.040 ms 0.040 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.017 ms 0.016 ms 1.06
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.050 ms 0.054 ms 0.91
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.012 ms 0.012 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.041 ms 0.043 ms 0.95
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.028 ms 0.028 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.045 ms 0.046 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.016 ms 0.016 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.045 ms 0.047 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 1.284 ms 1.286 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 1.950 ms 1.885 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.124 ms 0.145 ms 0.86
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.527 ms 0.551 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 1.692 ms 1.753 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 2.284 ms 2.349 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.148 ms 0.173 ms 0.86
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.551 ms 0.601 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 2.751 ms 2.822 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 3.041 ms 3.162 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.190 ms 0.229 ms 0.83
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 0.603 ms 0.656 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 35.884 ms 35.357 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 58.332 ms 57.782 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 2.384 ms 2.718 ms 0.88
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 12.464 ms 12.585 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 48.371 ms 48.015 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 70.765 ms 72.106 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 2.879 ms 3.361 ms 0.86
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 13.251 ms 13.377 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 77.060 ms 76.753 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 99.427 ms 97.301 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 4.529 ms 4.601 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 14.308 ms 14.395 ms 0.99
So I'm not sure if we should include vectorized version or not. I have rebased all changes into 150daa2dc57a258ba61a01e12901518b6b4d98e8.
I think I will wait on opinions of @prclibo and @alalek about this.
I agree. We should not include this if it is slower in any common case.
My tests were with LLVM 7.3.0 (clang-703.0.29). I will investigate to see if there is some reason why vectorization isn't enabled in the OpenCV build settings. I think there is also a debug mode that will tell you why a loop wasn't vectorized.
Others with more experience with vectorization may be able to shed light on the best approach (possibly do nothing and leave it to the compiler).
BTW, I have a few other performance ideas:
computeError()
into findInliers()
could avoid iterating over all of the points a second time and eliminate the need to write to and then read from the err
array. But it reduces the separation between the functions, which adds complexity. This would help RANSAC (for large numbers of points), but not LMedS. I think it could be worth measuring to see if it is significant.findInliers()
(currently it only calls it once at the end). It would be great to get input from someone more familiar with LMedS to make sure there isn't a flaw with this idea.runKernel()
with all of the points (like findHomography()
does). Or, adding this step could reduce the number of steps that LM takes to converge.hey @hrnr @mself thanks for the improvement. the discussion's insightful:)
@prclibo What do you think about manually vectorized version of computeError()
, should I include it?
@hrnr Honestly I do not know about SSE optimization=(. My personal opinion: It is good to have an optimized and fast code. But if not fully understanding about the optimization mechanism, it is also fine to keep the implementation as simple as it is.
It looks like clang isn't able to vectorize loops like the one in Affine2DEstimatorCallback ::computeError()
. I tried the following simple loop that has a similar interleaved access pattern.
typedef struct {
float x;
float y;
} point;
void bar (const point *a, const point *b, float *c, int n)
{
for (int i = 0; i < n; i++) {
c[i] = (a[i].x * b[i].x) + (a[i].y * b[i].y);
}
}
When compiled with -Rpass-analysis=loop-vectorize
, I get the remarks:
test.cpp:51:22: remark: the cost-model indicates that vectorization is not beneficial
[-Rpass-analysis=loop-vectorize]
c[i] = (a[i].x * b[i].x) + (a[i].y * b[i].y);
^
test.cpp:51:22: remark: the cost-model indicates that interleaving is not beneficial
[-Rpass-analysis=loop-vectorize]
I also tried using a #pragma
to force clang to vectorize the loop, but the results were very poor. It generated lots of non-vectorized instructions along with a few vectorized ones and a lot of shuffle instructions.
So one option would be to include the manually vectorized code but only enable it for clang. Another option would be for me to switch to GCC when compiling OpenCV :-)
I'd like to include a vectorized version for clang, but I wasn't happy that the universal intrinsics was missing the 2-channel v_load_deinterleave()
, which was resulting in slower code. So I added the 2-channel float version for SSE and NEON 67d632c9464bbdfa072fe4963193a60f90c3ab48!
I then updated the vectorized version to use it ef663f4fb35dbe023f69496ee03fdd13ff2e2b5e, and I get slightly better performance numbers now (and the vectorized code is much simpler). The generated code looks the same as what you showed from gcc 4.9.3, so I am hopeful that this version will be no slower than any of the auto-vectorized versions.
Here are the perf results:
Geometric mean
Name of Test calib3d calib3d calib3d calib3d
posix posix posix posix
x64 x64 x64 x64
5cde391 5cde391 5cde391 5cde391
20160817-nosimd 20160817-simd 20160817-simd 20160817-simd
vs vs
calib3d calib3d
posix posix
x64 x64
5cde391 5cde391
20160817-nosimd 20160817-nosimd
(x-factor) (score)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.023 ms 0.022 ms 1.06 faster
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.053 ms 0.054 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.024 ms 0.020 ms 1.19 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.049 ms 0.047 ms 1.04 faster
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.029 ms 0.027 ms 1.07 faster
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.045 ms 0.046 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.030 ms 0.025 ms 1.19 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.054 ms 0.053 ms 1.01
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.052 ms 0.049 ms 1.04
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.068 ms 0.069 ms 0.99
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.043 ms 0.036 ms 1.22 FASTER
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.069 ms 0.061 ms 1.12 faster
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.369 ms 2.292 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 3.329 ms 3.266 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.399 ms 0.200 ms 1.99 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.976 ms 0.761 ms 1.28 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 2.882 ms 2.778 ms 1.04
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 3.843 ms 3.741 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.499 ms 0.246 ms 2.02 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 1.063 ms 0.799 ms 1.33 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 4.419 ms 4.326 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 4.916 ms 4.733 ms 1.04
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.755 ms 0.368 ms 2.05 FASTER
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.299 ms 0.912 ms 1.42 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 69.528 ms 67.853 ms 1.02
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 119.832 ms 116.765 ms 1.03
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 7.546 ms 3.920 ms 1.92 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 21.852 ms 19.590 ms 1.12 faster
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 85.240 ms 85.687 ms 0.99
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 138.463 ms 136.450 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 9.314 ms 4.879 ms 1.91 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 23.353 ms 20.709 ms 1.13 faster
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 128.292 ms 131.003 ms 0.98
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 142.334 ms 148.350 ms 0.96
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 13.972 ms 6.886 ms 2.03 FASTER
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 28.126 ms 22.698 ms 1.24 faster
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.012 ms 0.011 ms 1.08 faster
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.034 ms 0.036 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.011 ms 0.010 ms 1.14 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.031 ms 0.031 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.015 ms 0.014 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.039 ms 0.039 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.014 ms 0.012 ms 1.18 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.034 ms 0.033 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.023 ms 0.021 ms 1.09 faster
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.035 ms 0.035 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.019 ms 0.015 ms 1.20 FASTER
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.038 ms 0.037 ms 1.05 faster
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 1.097 ms 1.081 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 1.586 ms 1.566 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.168 ms 0.098 ms 1.71 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.523 ms 0.450 ms 1.16 faster
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 1.468 ms 1.453 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 1.966 ms 1.918 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.211 ms 0.113 ms 1.87 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.561 ms 0.468 ms 1.20 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 2.286 ms 2.314 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 2.551 ms 2.508 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.301 ms 0.156 ms 1.92 FASTER
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 0.654 ms 0.510 ms 1.28 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 29.840 ms 30.123 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 43.798 ms 44.995 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 3.211 ms 1.844 ms 1.74 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 9.677 ms 8.013 ms 1.21 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 40.414 ms 40.970 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 54.711 ms 56.088 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 3.990 ms 2.239 ms 1.78 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 10.130 ms 8.520 ms 1.19 faster
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 64.602 ms 65.833 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 80.721 ms 78.250 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 5.635 ms 2.982 ms 1.89 FASTER
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 11.751 ms 9.224 ms 1.27 FASTER
Thanks for the 2-channel deinterleave. It was on my TODO list. Looks nice.
I will test your version with 2-channel deinterleave on GCC 6.
I have took a look on what GCC 6 produces and it s not too different from version of GCC 4.9. It uses the same vectorization approach. However intruction ordering is different, GCC 6 prefers to stick movaps
before movss
memory loads in each iterations. Math is the same, but GCC6 was able to safe 1 movaps
, which was blocking second subss
in some cases by better working with temporal results. A beatiful job from a compiler!
1d7: 0f 86 03 07 00 00 jbe 8e0 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x8e0>
1dd: 0f 28 cf movaps %xmm7,%xmm1
1e0: 41 8d 74 24 fc lea -0x4(%r12),%esi
1e5: 4c 8b 95 38 fe ff ff mov -0x1c8(%rbp),%r10
1ec: 44 0f 28 f4 movaps %xmm4,%xmm14
1f0: 31 c0 xor %eax,%eax
1f2: 0f c6 c9 00 shufps $0x0,%xmm1,%xmm1
1f6: c1 ee 02 shr $0x2,%esi
1f9: 44 0f 28 eb movaps %xmm3,%xmm13
1fd: 83 c6 01 add $0x1,%esi
200: 44 0f 28 e5 movaps %xmm5,%xmm12
204: 8d 14 b5 00 00 00 00 lea 0x0(,%rsi,4),%edx
20b: 0f 29 8d 00 fe ff ff movaps %xmm1,-0x200(%rbp)
212: 41 0f 28 c8 movaps %xmm8,%xmm1
216: 31 ff xor %edi,%edi
218: 44 0f 28 de movaps %xmm6,%xmm11
21c: 0f c6 c9 00 shufps $0x0,%xmm1,%xmm1
220: 45 0f c6 f6 00 shufps $0x0,%xmm14,%xmm14
225: 45 0f c6 ed 00 shufps $0x0,%xmm13,%xmm13
22a: 45 0f c6 e4 00 shufps $0x0,%xmm12,%xmm12
22f: 45 0f c6 db 00 shufps $0x0,%xmm11,%xmm11
234: 0f 29 8d 20 fe ff ff movaps %xmm1,-0x1e0(%rbp)
23b: 45 0f 28 fe movaps %xmm14,%xmm15
23f: 83 c7 01 add $0x1,%edi
242: 0f 10 04 43 movups (%rbx,%rax,2),%xmm0
246: 0f 10 4c 43 10 movups 0x10(%rbx,%rax,2),%xmm1
24b: 44 0f 28 c8 movaps %xmm0,%xmm9
24f: 0f c6 c1 dd shufps $0xdd,%xmm1,%xmm0
253: 41 0f 10 14 42 movups (%r10,%rax,2),%xmm2
258: 44 0f c6 c9 88 shufps $0x88,%xmm1,%xmm9
25d: 41 0f 28 cd movaps %xmm13,%xmm1
261: 45 0f 10 54 42 10 movups 0x10(%r10,%rax,2),%xmm10
267: 0f 59 c8 mulps %xmm0,%xmm1
26a: 0f 59 85 00 fe ff ff mulps -0x200(%rbp),%xmm0
271: 45 0f 59 f9 mulps %xmm9,%xmm15
275: 45 0f 59 cb mulps %xmm11,%xmm9
279: 41 0f 58 cf addps %xmm15,%xmm1
27d: 44 0f 28 fa movaps %xmm2,%xmm15
281: 41 0f 58 c1 addps %xmm9,%xmm0
285: 41 0f c6 d2 dd shufps $0xdd,%xmm10,%xmm2
28a: 45 0f c6 fa 88 shufps $0x88,%xmm10,%xmm15
28f: 41 0f 58 cc addps %xmm12,%xmm1
293: 0f 58 85 20 fe ff ff addps -0x1e0(%rbp),%xmm0
29a: 41 0f 5c cf subps %xmm15,%xmm1
29e: 0f 5c c2 subps %xmm2,%xmm0
2a1: 0f 59 c9 mulps %xmm1,%xmm1
2a4: 0f 59 c0 mulps %xmm0,%xmm0
2a7: 0f 58 c8 addps %xmm0,%xmm1
2aa: 0f 11 0c 01 movups %xmm1,(%rcx,%rax,1)
2ae: 48 83 c0 10 add $0x10,%rax
2b2: 39 f7 cmp %esi,%edi
2b4: 72 85 jb 23b <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x23b>
2b6: 44 39 e2 cmp %r12d,%edx
2b9: 0f 84 2e 01 00 00 je 3ed <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x3ed>
2bf: 48 63 f2 movslq %edx,%rsi
2c2: 0f 28 d3 movaps %xmm3,%xmm2
2c5: 48 8d 04 f5 00 00 00 lea 0x0(,%rsi,8),%rax
2cc: 00
2cd: 44 0f 28 cc movaps %xmm4,%xmm9
2d1: 48 8d 3c 03 lea (%rbx,%rax,1),%rdi
2d5: 4c 01 d0 add %r10,%rax
2d8: f3 0f 10 0f movss (%rdi),%xmm1
2dc: f3 0f 10 47 04 movss 0x4(%rdi),%xmm0
2e1: f3 44 0f 59 c9 mulss %xmm1,%xmm9
2e6: f3 0f 59 d0 mulss %xmm0,%xmm2
2ea: f3 0f 59 ce mulss %xmm6,%xmm1
2ee: f3 0f 59 c7 mulss %xmm7,%xmm0
2f2: f3 41 0f 58 d1 addss %xmm9,%xmm2
2f7: f3 0f 58 c1 addss %xmm1,%xmm0
2fb: f3 0f 58 d5 addss %xmm5,%xmm2
2ff: f3 41 0f 58 c0 addss %xmm8,%xmm0
304: f3 0f 5c 10 subss (%rax),%xmm2
308: f3 0f 5c 40 04 subss 0x4(%rax),%xmm0
30d: 8d 42 01 lea 0x1(%rdx),%eax
310: 41 39 c4 cmp %eax,%r12d
313: f3 0f 59 d2 mulss %xmm2,%xmm2
317: f3 0f 59 c0 mulss %xmm0,%xmm0
31b: f3 0f 58 c2 addss %xmm2,%xmm0
31f: f3 0f 11 04 b1 movss %xmm0,(%rcx,%rsi,4)
324: 0f 8e c3 00 00 00 jle 3ed <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x3ed>
32a: 48 98 cltq
32c: 0f 28 d3 movaps %xmm3,%xmm2
32f: 48 8d 34 c5 00 00 00 lea 0x0(,%rax,8),%rsi
336: 00
337: 44 0f 28 cc movaps %xmm4,%xmm9
33b: 83 c2 02 add $0x2,%edx
33e: 48 8d 3c 33 lea (%rbx,%rsi,1),%rdi
342: 4c 01 d6 add %r10,%rsi
345: 44 39 e2 cmp %r12d,%edx
348: f3 0f 10 07 movss (%rdi),%xmm0
34c: f3 0f 10 4f 04 movss 0x4(%rdi),%xmm1
351: f3 44 0f 59 c8 mulss %xmm0,%xmm9
356: f3 0f 59 d1 mulss %xmm1,%xmm2
35a: f3 0f 59 c6 mulss %xmm6,%xmm0
35e: f3 0f 59 cf mulss %xmm7,%xmm1
362: f3 41 0f 58 d1 addss %xmm9,%xmm2
367: f3 0f 58 c1 addss %xmm1,%xmm0
36b: f3 0f 58 d5 addss %xmm5,%xmm2
36f: f3 41 0f 58 c0 addss %xmm8,%xmm0
374: f3 0f 5c 16 subss (%rsi),%xmm2
378: f3 0f 5c 46 04 subss 0x4(%rsi),%xmm0
37d: f3 0f 59 d2 mulss %xmm2,%xmm2
381: f3 0f 59 c0 mulss %xmm0,%xmm0
385: f3 0f 58 c2 addss %xmm2,%xmm0
389: f3 0f 11 04 81 movss %xmm0,(%rcx,%rax,4)
38e: 7d 5d jge 3ed <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x3ed>
390: 48 63 d2 movslq %edx,%rdx
393: 48 8d 04 d5 00 00 00 lea 0x0(,%rdx,8),%rax
39a: 00
39b: 48 01 c3 add %rax,%rbx
39e: 48 03 85 38 fe ff ff add -0x1c8(%rbp),%rax
3a5: f3 0f 10 0b movss (%rbx),%xmm1
3a9: f3 0f 10 43 04 movss 0x4(%rbx),%xmm0
3ae: f3 0f 59 e1 mulss %xmm1,%xmm4
3b2: f3 0f 59 d8 mulss %xmm0,%xmm3
3b6: f3 0f 59 f1 mulss %xmm1,%xmm6
3ba: f3 0f 59 c7 mulss %xmm7,%xmm0
3be: f3 0f 58 dc addss %xmm4,%xmm3
3c2: f3 0f 58 c6 addss %xmm6,%xmm0
3c6: f3 0f 58 eb addss %xmm3,%xmm5
3ca: f3 44 0f 58 c0 addss %xmm0,%xmm8
3cf: f3 0f 5c 28 subss (%rax),%xmm5
3d3: f3 44 0f 5c 40 04 subss 0x4(%rax),%xmm8
3d9: f3 0f 59 ed mulss %xmm5,%xmm5
3dd: f3 45 0f 59 c0 mulss %xmm8,%xmm8
3e2: f3 44 0f 58 c5 addss %xmm5,%xmm8
3e7: f3 44 0f 11 04 91 movss %xmm8,(%rcx,%rdx,4)
3ed: 48 8b 45 a8 mov -0x58(%rbp),%rax
3f1: 48 85 c0 test %rax,%rax
3f4: 74 13 je 409 <_ZNK2cv25Affine2DEstimatorCallback12computeErrorERKNS_11_InputArrayES3_S3_RKNS_12_OutputArrayE+0x409>
I have tested your 2-channel version. GCC just went crazy with:
../modules/core/include/opencv2/core/hal/intrin_sse.hpp: In function ‘void cv::v_store_interleave(float*, const cv::v_float32x4&, const cv::v_float32x4&)’:
../modules/core/include/opencv2/core/hal/intrin_sse.hpp:1547:15: warning: unused variable ‘mask_lo’ [-Wunused-variable]
const int mask_lo = _MM_SHUFFLE(2, 0, 2, 0), mask_hi = _MM_SHUFFLE(3, 1, 3, 1);
^~~~~~~
../modules/core/include/opencv2/core/hal/intrin_sse.hpp:1547:50: warning: unused variable ‘mask_hi’ [-Wunused-variable]
const int mask_lo = _MM_SHUFFLE(2, 0, 2, 0), mask_hi = _MM_SHUFFLE(3, 1, 3, 1);
^~~~~~~
I don't think its safe to use variables as control for _mm_shuffle_ps
since it actually generates shufps
. Control for shufps
(imm8) must be immediate. And that's why gcc reports unused variables.
I have run the tests on this version and it is the fastest manually vectorized version. About ~5% faster than previous version. I think it is a good job and the code is perfectly readable.
But the gcc 6.1 is still better about ~10%.
Geometric mean
Name of Test calib3d calib3d calib3d
posix posix posix
x64 x64 x64
ef663f4 ef663f4 ef663f4
auto vectorized vectorized
vec vs
only calib3d
posix
x64
ef663f4
auto
vec
only
(x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.029 ms 0.031 ms 0.94
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.072 ms 0.074 ms 0.97
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.021 ms 0.021 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.058 ms 0.059 ms 0.99
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.036 ms 0.036 ms 0.98
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.058 ms 0.061 ms 0.96
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.027 ms 0.026 ms 1.04
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.060 ms 0.064 ms 0.93
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.060 ms 0.063 ms 0.95
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.091 ms 0.088 ms 1.04
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.039 ms 0.037 ms 1.04
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.075 ms 0.076 ms 0.99
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.781 ms 2.709 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 4.120 ms 3.870 ms 1.06
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.257 ms 0.276 ms 0.93
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.952 ms 0.924 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 3.568 ms 3.383 ms 1.05
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 4.792 ms 4.729 ms 1.01
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.298 ms 0.342 ms 0.87
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 1.034 ms 0.982 ms 1.05
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 5.309 ms 5.224 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 6.017 ms 6.117 ms 0.98
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.445 ms 0.478 ms 0.93
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.112 ms 1.145 ms 0.97
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 81.476 ms 80.553 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 138.876 ms 139.970 ms 0.99
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 4.907 ms 5.358 ms 0.92
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 33.169 ms 33.396 ms 0.99
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 102.326 ms 97.172 ms 1.05
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 160.907 ms 152.469 ms 1.06
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 6.150 ms 6.692 ms 0.92
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 35.220 ms 34.744 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 151.706 ms 146.966 ms 1.03
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 176.685 ms 168.195 ms 1.05
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 9.078 ms 10.040 ms 0.90
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 37.006 ms 38.464 ms 0.96
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.013 ms 0.013 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.047 ms 0.048 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.010 ms 0.010 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.040 ms 0.041 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.017 ms 0.017 ms 0.97
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.052 ms 0.053 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.012 ms 0.012 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.042 ms 0.043 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.028 ms 0.027 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.045 ms 0.046 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.016 ms 0.015 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.046 ms 0.047 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 1.319 ms 1.253 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 1.982 ms 1.973 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.120 ms 0.130 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 0.521 ms 0.533 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 1.762 ms 1.745 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 2.465 ms 2.439 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.138 ms 0.156 ms 0.89
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 0.555 ms 0.529 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 2.747 ms 2.781 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 3.158 ms 3.201 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.201 ms 0.215 ms 0.94
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 0.574 ms 0.585 ms 0.98
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 35.946 ms 34.784 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 57.799 ms 57.446 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 2.298 ms 2.523 ms 0.91
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 12.577 ms 13.177 ms 0.95
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 48.473 ms 46.815 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 72.757 ms 69.004 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 2.756 ms 2.987 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 13.268 ms 13.391 ms 0.99
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 77.844 ms 74.696 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 101.374 ms 97.356 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 3.846 ms 4.175 ms 0.92
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 13.957 ms 14.556 ms 0.96
I have also tested latest clang 3.8.1 and I can confirm your results. For clang the manually-vectorized version is faster.
Geometric mean
Name of Test calib3d calib3d calib3d
posix posix posix
x64 x64 x64
d56de49 d56de49 d56de49
20160817-103112 20160817-103452 20160817-103452
vs
calib3d
posix
x64
d56de49
20160817-103112
(x-factor)
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.068 ms 0.066 ms 1.03
EstimateAffine2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.137 ms 0.133 ms 1.03
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.042 ms 0.036 ms 1.14
EstimateAffine2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.102 ms 0.097 ms 1.06
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.081 ms 0.078 ms 1.04
EstimateAffine2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.116 ms 0.112 ms 1.03
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.052 ms 0.044 ms 1.17
EstimateAffine2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.111 ms 0.109 ms 1.02
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.129 ms 0.125 ms 1.03
EstimateAffine2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.165 ms 0.161 ms 1.03
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.075 ms 0.065 ms 1.16
EstimateAffine2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.136 ms 0.127 ms 1.07
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 4.917 ms 4.792 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 7.409 ms 7.218 ms 1.03
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.861 ms 0.516 ms 1.67
EstimateAffine2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 2.310 ms 2.001 ms 1.15
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 5.992 ms 5.865 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 8.510 ms 8.316 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 1.081 ms 0.639 ms 1.69
EstimateAffine2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 2.527 ms 2.096 ms 1.21
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 9.402 ms 9.220 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 10.667 ms 10.431 ms 1.02
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 1.648 ms 0.961 ms 1.72
EstimateAffine2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 3.068 ms 2.368 ms 1.30
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 141.065 ms 139.235 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 213.813 ms 211.086 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 17.388 ms 10.649 ms 1.63
EstimateAffine2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 51.688 ms 45.062 ms 1.15
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 173.158 ms 170.267 ms 1.02
EstimateAffine2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 245.782 ms 242.255 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 21.924 ms 13.371 ms 1.64
EstimateAffine2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 56.390 ms 47.730 ms 1.18
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 261.662 ms 257.326 ms 1.02
EstimateAffine2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 287.768 ms 283.550 ms 1.01
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 33.363 ms 19.956 ms 1.67
EstimateAffine2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 67.788 ms 54.432 ms 1.25
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 0) 0.034 ms 0.033 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, LMEDS, 10) 0.091 ms 0.091 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 0) 0.021 ms 0.018 ms 1.13
EstimateAffinePartial2D::EstimateAffine::(100, 0.9, RANSAC, 10) 0.069 ms 0.066 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 0) 0.044 ms 0.043 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, LMEDS, 10) 0.102 ms 0.100 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 0) 0.025 ms 0.022 ms 1.14
EstimateAffinePartial2D::EstimateAffine::(100, 0.95, RANSAC, 10) 0.073 ms 0.069 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 0) 0.067 ms 0.065 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, LMEDS, 10) 0.096 ms 0.093 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 0) 0.034 ms 0.028 ms 1.22
EstimateAffinePartial2D::EstimateAffine::(100, 0.99, RANSAC, 10) 0.082 ms 0.077 ms 1.08
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 0) 2.208 ms 2.111 ms 1.05
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, LMEDS, 10) 3.359 ms 3.336 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 0) 0.365 ms 0.236 ms 1.55
EstimateAffinePartial2D::EstimateAffine::(5000, 0.9, RANSAC, 10) 1.179 ms 1.025 ms 1.15
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 0) 3.027 ms 2.927 ms 1.03
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, LMEDS, 10) 4.153 ms 4.137 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 0) 0.455 ms 0.286 ms 1.59
EstimateAffinePartial2D::EstimateAffine::(5000, 0.95, RANSAC, 10) 1.238 ms 1.058 ms 1.17
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 0) 4.884 ms 4.714 ms 1.04
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, LMEDS, 10) 5.431 ms 5.414 ms 1.00
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 0) 0.665 ms 0.402 ms 1.65
EstimateAffinePartial2D::EstimateAffine::(5000, 0.99, RANSAC, 10) 1.447 ms 1.189 ms 1.22
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 0) 62.665 ms 61.630 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, LMEDS, 10) 92.328 ms 91.188 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 0) 7.406 ms 4.887 ms 1.52
EstimateAffinePartial2D::EstimateAffine::(100000, 0.9, RANSAC, 10) 23.342 ms 20.811 ms 1.12
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 0) 83.958 ms 82.595 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, LMEDS, 10) 113.672 ms 112.147 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 0) 9.215 ms 5.966 ms 1.54
EstimateAffinePartial2D::EstimateAffine::(100000, 0.95, RANSAC, 10) 25.183 ms 21.715 ms 1.16
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 0) 134.009 ms 131.432 ms 1.02
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, LMEDS, 10) 163.457 ms 161.649 ms 1.01
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 0) 13.343 ms 8.393 ms 1.59
EstimateAffinePartial2D::EstimateAffine::(100000, 0.99, RANSAC, 10) 29.310 ms 24.297 ms 1.21
Good catch on not using vars for mask_lo
and mask_hi
. I copied that idea from some other code in OpenCV, so that might be broken with GCC, too. I guess that clang handles it ok?
One issue on the 2-channel support is that for SSE I only added it for floats. For NEON it's trivial to do it for all types, but for SSE it seems to be quite complicated for some of the integer types (8x32, in particular). So I just did it for float only rather than put in code I wasn't sure about. I doubt there is much need for 2-channel integer support, but I don't like that it is supported for NEON but not for SSE.
Do you know if there is any support for AVX in OpenCV? That could be 2x faster since it has 8-wide float registers (and also 3-operand instructions that eliminate the need for copy instructions, like the movaps
you mentioned above). There is also AVX512, but I think that is not available on the most common Intel processors.
If GCC supports AVX by default (it does) and GCC is auto-vectorizing the computeError()
function for you, then why isn't it generating an AVX version instead of an SSE one? Are there some flags that need to be set to enable AVX in OpenCV? Figuring this out could be a big win...
If I use clang -mavx
, I do get AVX auto-vectorization. It does a very poor job on the deinterleaving, but GCC will probably do better. But then the binaries can only run on CPUs with AVX. What we want is runtime variants, but I wasn't clear how to do that in a cross-compiler way. GCC has __attribute___((target("avx")))
, and clang says it supports that, too, but I got an error when I tried to use it. I looked through existing OpenCV code and didn't see any similar uses.
It would be pretty cool to come up with a simple way in OpenCV to build functions that use AVX when present but also run on non-AVX CPUs. That could speed up a lot of things besides stitching. Or is it better to just build two entire sets of libraries (one with AVX and one without) and have apps dynamically link the right one?
I did a little thinking, and my conclusion is that manual vectorization is probably not a good use of effort in OpenCV these days (possibly with some exceptions). Even though clang has limitations, it's sure to get better soon and GCC is already performing better than the manually vectorized code.
More interesting would be to figure out how to selectively enable AVX. It's at the turning point in adoption where most recent computers support it, but not enough to make it a hard requirement. SSE is already required by default with most compilers, but not yet AVX (maybe in a couple of years?).
It seems like you need to do something like this:
if (check_avx_at_runtime()) {
#pragma enable(avx) // needs to work across compilers
loop goes here // will autovectorize with AVX
#pragma disable(avx)
} else {
same loop goes here // will autovectorize with SSE or NEON, as appropriate for CPU
}
The bummer is that you have two copies of the code to maintain. Could we create a macro that hides this, but isn't too awkward to use for long stretches of code?
This could then be easily added to important vectorizable code segments anywhere in OpenCV.
1: I think the codegen is ok with GCC. I was just explaining why it generates the warning. For other parts of opencv I don't know, but I got these warning only for your code. But I could have missed something as there were lots of compile units with these warnings.
Yes 2-channel version is probably most useful for deinterleaving Point2f. I think when you open PR with these changes OpenCV gurus will help you with that as this is a cool core feature. :) Also nice that you fixed the documentation.
There is support for AVX see cmake flags ENABLE_AVX
ENABLE_AVX2
. There is also AVX code in opencv, which is guarded by CV_AVX
for example in modules/imgproc/src/accum.cpp
. OpenCV however goes generally with SSE2, I think that's the primarly supported platform. AVX512 is currently MIC only. In my experience AVX will not be 2 times faster.
2: I have been running my tests with disabled AVX to get comparable results. Flags: see above.
3: That's why there are everywhere runtime checks for SSE support, even the code is guarded by macros. Distributions needs to build for everyone, so they turning all these features off. OpenCV has SSE enabled by default, but then does all these runtime checks so the people can build OpenCV with SSE and safely push that version as general x86
version for everyone and people can benefit from SSE (now almost everybody has these extensions). I'm not sure if this this is working as distributins still turn these off (for example archlinux.
4: Yes, that is similar to pattern for SSE. I don't think however that will be beneficial for most users. Distributions will turn it off anyway and I'm guessing people considered about performance are building their own versions of OpenCV tailored for their architecture (or they should).
What seems like an interesting to me would the option to disable manually vectorized code and use auto-vectorization, while building of course with all vector extensions enabled. This would need to be done probably on per function basis, but I'm for some function auto-vectorization would be faster than current manually vectorized code, especially when it could use never instructions as most of the OpenCV is vectorized with SSE2.
To sum up. I'm going to change computeError()
to work with floats, which makes it faster in all cases. But when this get merged (I hope soon, as GSoC is finishing right now) and the 2-channel deinterleave gets merged feel free to add the vectorized version. I think the latest iteration is very nice and probably it bring speedup for the most users these days.
BTW: We get pretty hardcore in optimizing one function, but are you sure the rest of the ransac code etc. is optimal? Do you have some logs from profiler? I somehow can't believe the real computation is the bottleneck (which would mean the function is optimal). (Also #7101 might be nice for tuning OpenCV.)
Thanks for the summary. I will create a PR for the 2-channel support. Should I also create one for the LMedS nth_element()
change, or did you include that in your changes?
In terms of tuning, for RANSAC with a large number of points I think that > 50% of the time is spent in computeError()
. It is the only part of the algorithm that scales with n
. When I added the SSE version, the overall speed went up 2x for large n
compared to no vectorization. But I haven't profiled it, which I should. There is also a variant of RANSAC that only evaluates each model with a subset of the points. That could be faster when n
is very large.
In your application, how many points do you typically have? In my application (video stabilization) the number of points is only ~200, and 25-50% of those get rejected as outliers due to camera motion between frames (you want the stabilization to lock onto the distant points rather than nearby features that have apparent motion).
Also, what level of accuracy do you need? Do you use the LM refinement or just go with the best model computed with 2 points? The LM part could probably be optimized, as well.
Thanks for valuable discussion. LMeDs change is not included.
I have between 200-500 points from feature matching. I use LM with estimate* functions and then I run my own LM to optimize the whole system of transformations again. For me estimate* functions are fast enough, there are slower things in my process.
OK, I'll make a PR for the LMedS change, too. For me, finding and tracking the features is quite a lot slower than the motion estimation. This just seemed like a fun optimization problem to work on where I could learn more about OpenCV while also learning from someone else working in the same area. Good luck with finishing up your GSOC project!
The 2-channel support for universal intrinsics is PR #7182. And the LMedS optimization is PR #7183.
There is merge conflict in modules/calib3d/src/precomp.hpp
. Could you resolve it? (or enable "Allow edits from maintainers" option for this PR)
rebased on the current master. I have also squashed some of the fixup commits and minor changes and reworded some commits with typos.
Let me know if there are some other issues. Thanks for reviewing.
:+1:
@hrnr Could you please resolve conflict again? #7443 was merged first =( There are two commits affected (you may use "git merge" or squash your commits into one before rebase).
rebased again.
Merge with extra: https://github.com/opencv/opencv_extra/pull/303
This PR contains all work for New camera model for stitching pipeline GSoC 2016 project.
GSoC Proposal
Stitching pipeline is a well established code in OpenCV. It provides good results for creating panoramas from camera captured images. Main limitation of stitching pipeline is its expected camera model (perspective transformation). Although this model is fine for many applications working with camera captured images, there are applications which aren't covered by current stitching pipeline.
New camera model
Due to physical constraints it is possible for some applications to expect much simpler transform with less degrees of freedom. Those are situations when input data are not subject to perspective transform. The transformation can be much simpler, such as affine transformation. Datasets considered here includes images captured by special hardware (such as book scanners[0] that tries hard to eliminate perspective), maps from laser scanning (produced from different starting points), preprocessed images (where perspective was compensated by other robust means, taking advantage of physical situation, e.g. for book scanners we would use data from calibration to compensate remaining perspective). In all those situations we would like to obtain image mosaic under affine transformation.
I'd like to introduce new camera model based on affine transformation to stitching pipeline. This would include:
I used approach based on affine transformation to merge maps produced by multiple robots [1] for my robotics project. It shows a good results. However, as mentioned earlier applications for this model are much broader than that.
Parallelism for FeaturesFinder
To make usage of stitching pipeline more comfortable and performant for large number of images, I’d like also to improve FeaturesFinder to allow finding features in parallel. All camera models and other users of FeaturesFinder may take benefit from that. The API could be similar to
FeaturesMatcher::operator ()(features, pairwise_matches, mask)
.This could be with TBB in similar manner as mentioned method in FeaturesMatcher, which is already being used in stitching pipeline so there would be almost no additional overhead in starting new threads in typical scenarios, because these threads are there already for FeaturesMatcher. This change would be fully integrated into high level stitching interface.
There might be some changes necessary in finders to ensure thread-safety. Where thread-safety can’t be ensured or it does not make sense (GPU finders), parallelization would be disabled and all images would be processed in serial manner so this method would be always safe to use regardless of underlying finder. This approach is also similar to FeaturesMatcher.
Benefits to OpenCV
implemented goals (all + extras)
new camera model
parallel feature finding
implemented extras
video
other work
During this GSoC I have also coded some related work, that is not going to be included (mostly because we has chosen different approach or the work has been merged under this PR). It is listed here for completeness.
PRs:
6560
6609
6615
6642
commits: