from_rotation_matrix improved algorithms

willwray commented 4 years ago

According to the results of this survey paper

A Survey on the Computation of Quaternions from Rotation Matrices Soheil Sarabandi and Federico Thomas, October 2018 Journal of Mechanisms and Robotics 11(2)

the preferred algorithm for an orthonormal input matrix should be Cayley’s method (section 3.5 page 14).

• Cayley’s method, besides being the simplest one, is superior in terms of accuracy and speed. Among the numerical methods, the second version of Bar-Itzhack’s method is the only one that deserves some attention as it can be used to obtain the quaternion corresponding to a non-perfectly orthogonal rotation matrix [48]. 20 In this case, the quaternion corresponds to the nearest orthogonal matrix to the input non-orthogonal matrix, where closeness is expressed in the Frobenius norm [52].

I'd be interested to review and test the implementation of the Bar-Itzhack algorithm.

If you'd like then I can attempt to contribute PRs:

Change from Markley to Cayley for orthonormal case
Tests of Bar-Itzhack implementation for speed and accuracy 2.1 Improve Bar-Itzhack speed and/or accuracy

willwray commented 4 years ago

I did a quick first implementation of Cayley's method (still need to do copysign for components).

It occurs to me that the solution q should be a good first approximation for the eigenvector of Bar-Itzhack algorithm, as the matrices used in the two methods are closely related, in which case iterative improvement of the solution should converge quickly.

Ideally, the nonorthogonal=True default argument can be removed with a robust algorithm:

Compute the 4 quaternion components q using (homogeneous) Cayley's method
Iteratively improve the q by repeated multiplication by P(eq. 41 & 42 of linked paper)

There could be an intermediate step to detect if the input matrix is far from being a rotation, or this could be dealt with by stopping after a small fixed number of iterative improvements and returning an error result if q has not converged.

I'll code this up and test.

moble commented 4 years ago

Hi, and thanks for your interest! I had never seen that Cayley algorithm before, but it certainly looks worth a shot.

I'm a little skeptical that it would be worthwhile trying to beat LAPACK — which is what numpy/scipy use under the hood — for speed, even with a good guess (though there's lots of wrapper code that may be unnecessary). But I'm very worried that there are lots of weird edge cases that LAPACK has already dealt with that would be very hard to account for in custom code.

willwray commented 4 years ago

I hacked the paper's Matlab code more last night (using octave) and reproduce its results showing good accuracy and performance for Cayley compared to branching methods.

I completed the Cayley algo update to homogeneous form with copysign to choose signs then added a few `iterative improvements' as above, amounting to 4D matrix-vector multiplies, possibly followed by a final normalization. It shows promise - I emailed the author for feedback.

The use of iterative improvement of a good first approximation entirely avoids generic SVD. So, if it works, this should be more efficient.

IMO more needs to be done to evaluate the accuracy in different regions of quaternion space. Then, also need to compare against the current implementations for accuracy and performance.

willwray commented 4 years ago

Here's a paper that derives floating point error bounds of quaternion operations:

Algorithms for manipulating quaternions in floating-point arithmetic Mioara Joldeş, Jean-Michel Muller, ARITH-2020

As well as covering regular arithmetic operations, they look at quaternion <-> matrix conversions and reference the matrix-to-quaternion survey paper linked above. They derive error-bounds for the branching method but do not analyse the Cayley method.

The paper points out that if FMA fused multiply-add instructions are available then they allow alternative algorithms with a different balance of performance:

By doing this [using FMA algorithms] the calculations will be faster, and frequently more accurate. However, the error bound will remain unchanged

This will likely be acceptable and beneficial to most users.

Perhaps it's best to close this issue (and the other quaternion-to-matrix issue), pending a full comparison of the alternatives and trade-offs involved, including FMA algorithms.

moble commented 4 years ago

Unfortunately, because of how numpy works, FMA isn't really on the table. Basically, numpy expressions are processed by python first, which then passes certain things to numpy. So for example, when you write a * x + b, python does a * x first (by looking at the a and x objects, and figuring out that multiplying them means calling numpy with a and x as arguments to some function), then takes the result of that operation, and doing the + b part. Numpy doesn't get the chance to figure out that there's any FMA that could happen.

I suspect that all this discussion is really about micro-optimizations that are insignificant when calling through python — where overheads due to python figuring out which functions to call, or unnecessary allocations, or multiple passes over arrays, etc., will totally overwhelm any micro-optimizations within the algorithm. (Though these might be more relevant to, e.g., the quaternionic package, which can use numba to compile things, and thus could use FMAs and avoid those overheads.) Add to this the fact that quaternion is a pretty stable package, so that messing with results even at the level of roundoff may cause headaches with tests and reproducibility, and I am inclined against changes of these types unless there is very significant improvement to be had.

willwray commented 4 years ago

Thanks for closing... This turned out to be a bigger undertaking than expected.

I suspect that all this discussion is really about micro-optimizations

In fact, the intention was to improve accuracy. For example, the quaternion floating point arithmetic paper (linked above) gives error analysis for two quaternion multiplication algorithms:

"Naive" algorithm "does not allow one to bound the component-wise relative error of the quaternion product..."
"Ogita, Rump and Oishi’s algorithm" "a much better bound than" the naive multiplication, by using FMA to carry exact error terms

Now, the error analysis gets very involved for this issue case of matrix-to-quaternion conversion. Statistical testing of random rotations, as done in the initial referenced paper, is easier. However, it turns out that the 'position' of a rotation in both matrix and quaternion component space has a large effect on error bounds for the conversions and there are edge cases (actually, more literally 'corner cases' as they correspond to corners of a hypercube) which may be missed by random tests.

So it turns out that this is still very much an active research area. And, anyway, the current algorithms are generally pretty accurate:

CONCLUSION We have given relative error bounds for the major operations required for manipulating quaternions in floating-point arithmetic. These bounds are small, which confirms the fact that quaternions are easy to manipulate.

moble commented 4 years ago

anyway, the current algorithms are generally pretty accurate

Yeah. The tests cover various corner cases and some random values as well. Conversion from quaternion to matrix is accurate to within 2 times machine precision, and the round-trip from quaternion to matrix and back is never as bad as 5 times machine precision for the Markley algorithm and 10 for Bar-Itzhack. I seriously doubt that any application can realistically get anywhere close to requiring such precision with 64-bit floats, because far larger errors will surely result from basically anything used to construct a rotation matrix. The Joldeş-Muller paper seems much more relevant for applications where they might be using half-precision or even lower-precision floats — which is certainly useful given the increasing relevance of AI, computer graphics, and embedded systems — but probably not so relevant for double-precision floats.

I should also point out that Cayley's method as described by Sarabandi and Thomas is incorrect. Specifically, the method of setting the relative signs of the components can give incorrect results whenever the resulting quaternion has scalar component very close to 0. There needs to be a special case for that condition, falling back to — say — fixing the sign of the i component to be positive and setting the signs of j and k given that assumption. But then the i component may be close to 0, and so on. That is, I think it has to be something more like this:

def from_rotation_matrix_orthonormal(rot, q):
    signs_unset = True
    sign1 = 1
    sign2 = 1
    sign3 = 1
    q[0] = np.sqrt(
        (rot[0, 0] + rot[1, 1] + rot[2, 2] + 1) ** 2
        + (rot[2, 1] - rot[1, 2]) ** 2
        + (rot[2, 0] - rot[0, 2]) ** 2
        + (rot[1, 0] - rot[0, 1]) ** 2
    ) / 4
    if q[0] > eps:
        signs_unset = False
        sign1 = 1 if rot[2, 1] >= rot[1, 2] else -1
        sign2 = 1 if rot[0, 2] >= rot[2, 0] else -1
        sign3 = 1 if rot[1, 0] >= rot[0, 1] else -1
    q[1] = sign1 * np.sqrt(
        (rot[0, 0] - rot[1, 1] - rot[2, 2] + 1) ** 2
        + (rot[2, 1] - rot[1, 2]) ** 2
        + (rot[2, 0] + rot[0, 2]) ** 2
        + (rot[1, 0] + rot[0, 1]) ** 2
    ) / 4
    if signs_unset and q[1] > eps:
        signs_unset = False
        sign2 = 1 if rot[0, 1] >= 0.0 else -1
        sign3 = 1 if rot[0, 2] >= 0.0 else -1
    q[2] = sign2 * np.sqrt(
        (-rot[0, 0] + rot[1, 1] - rot[2, 2] + 1) ** 2
        + (rot[2, 1] + rot[1, 2]) ** 2
        + (rot[2, 0] - rot[0, 2]) ** 2
        + (rot[1, 0] + rot[0, 1]) ** 2
    ) / 4
    if signs_unset and q[2] > eps:
        sign3 = 1 if rot[1, 2] >= 0.0 else -1
    q[3] = sign3 * np.sqrt(
        (-rot[0, 0] - rot[1, 1] + rot[2, 2] + 1) ** 2
        + (rot[2, 1] + rot[1, 2]) ** 2
        + (rot[2, 0] + rot[0, 2]) ** 2
        + (rot[1, 0] - rot[0, 1]) ** 2
    ) / 4

Curiously, despite the much more complicated arithmetic in this function, it is somewhat faster than the Markley's method, which has fewer memory accesses, squares, and sqrts. This is true even when I rewrite the current implementation as a gufunc. The only reason I can think of is that Markley's method could be harder hit by branch prediction failures, but my experiments suggest that's only responsible for a small portion of the difference. (EDIT: Most of the difference is because of np.argmax in the Markley algorithm; doing that step more cleverly brings its time down significantly, though still not quite to that of Cayley's method. The remainder of the difference is consistent with branch prediction failures, which should be insignificant for reasonably continuous rotations.)

In any case, both take of order tens of nanoseconds per rotation, which means I don't suppose it will be much of a bottleneck for any realistic code. So I'm going to leave this code alone.

willwray commented 4 years ago

Yes, the sign assignment has to be done more carefully, including accounting for 0.0 valued leading components. Here's a modified version of the Cayley algorithm I'm testing:

def R_to_q(R):
    '''Convert 3x3 rotation matrix to unit quaternion'''
    q2 = np.sqrt( np.sum(np.square(R)) * 3 ) / 3

    Q = np.array([[R[0,0]+R[1,1]+R[2,2]+q2,  R[2,1]-R[1,2],  R[0,2]-R[2,0],  R[1,0]-R[0,1]],
                  [R[2,1]-R[1,2],  R[0,0]-R[1,1]-R[2,2]+q2,  R[0,1]+R[1,0],  R[2,0]+R[0,2]],
                  [R[0,2]-R[2,0],  R[0,1]+R[1,0], -R[0,0]+R[1,1]-R[2,2]+q2,  R[1,2]+R[2,1]],
                  [R[1,0]-R[0,1],  R[2,0]+R[0,2],  R[1,2]+R[2,1], -R[0,0]-R[1,1]+R[2,2]+q2]]);

    q = np.sum(np.abs(Q),axis=0)
    q /= np.linalg.norm(q)

    m = np.argmax( [Q[0,0], Q[1,1], Q[2,2], Q[3,3]] )
    for i in range(m):
        if Q[m,i] != 0.0:
            if Q[m,i] < 0.0:
                Q[m,:] = np.negative(Q[m,:])
            break
    np.copysign( q, Q[m,:], q)

    return q

Homogenises matrix Q with q^2 computed from the matrix components, instead of using 1.
Sums absolute values, rather than squares, and then normalizes
Selects the Q row with maximum diagonal value (diag values should all be +ve)
If the first non-zero component is negative then flip signs
Copy signs to the quaternion

moble / quaternion

from_rotation_matrix improved algorithms #148