quartiq / stabilizer

Firmware and software for the Sinara Stabilizer module with high speed, low latency ADC/DAC data processing and powerful DSP algorithms in between
http://quartiq.de/stabilizer/
Apache License 2.0
110 stars 25 forks source link

sin/cos/atan2/sqrt implementation analysis #191

Closed matthuszagh closed 3 years ago

matthuszagh commented 3 years ago

cos/sin

Conclusions presented at the end of this post.

The main options for a cos/sin implementation are:

  1. Lookup table (LUT) using integer phase values and fixed-point output values.
  2. f32 polynomial approximation
    • May either be a custom implementation or an implementation contributed to libm (e.g., adapted from newlib) (see https://github.com/rust-lang/libm/issues/248). One reason to possibly favor a custom implementation is that our implementation does not require support for phases > 2pi, whereas the libm implementation does.

fixed-point LUT

A fixed-point LUT implementation stores a 16-bit fixed-point sine value and 16-bit fixed-point cosine value (as a combined 32-bit integer) for some chosen number of phase values, where the phase values are equally spaced between 0 and pi/4. The number of phase values stored is a power of 2.

algorithm description

There is a Jupyter notebook for a migen implementation that is mostly the same as the algorithm described below.

There are a few different approaches to a fixed-point LUT. We could:

  1. Store sin and cos of phase values between 0 and 2pi.
  2. Store sin (or cos) of phase values between 0 and pi/2 and map values between pi/2 and 2pi to an equivalent expression with a phase value between 0 and pi/2. Cos values can be mapped to equivalent sin values (or vice versa).
  3. Store sin and cos of phase values between 0 and pi/4. All other phases can be converted to an equivalent expression with a phase in this range.

Option (1) is possibly the fastest because it does not require mapping inputs, but requires 8x the memory of (2) and (3). This greatly increased memory usage may actually make it slower than the other methods. Options (2) and (3) use the same size LUT, but (3) is preferred in cases where sin and cos are computed simultaneously. The reason for this is that the same phase value can be used for the sin and cos. This allows the phase mapping logic to be performed only once.

The mapping of all phase values to the 0 to pi/4 range is shown in the table below.

MSB bits phase (x) range sin(x) cos(x)
000 0 <= x < pi/4 sin(x) cos(x)
001 pi/4 <= x < pi/2 cos(pi/2-x) sin(pi/2-x)
010 pi/2 <= x < 3pi/4 cos(x-pi/2) -sin(x-pi/2)
011 3pi/4 <= x < pi sin(pi-x) -cos(pi-x)
100 pi <= x < 5pi/4 -sin(x-pi) -cos(x-pi)
101 5pi/4 <= x < 3pi/2 -cos(3pi/2-x) -sin(3pi/2-x)
110 3pi/2 <= x < 7pi/4 -cos(x-3pi/2) sin(x-3pi/2)
111 7pi/4 <= x < 2pi -sin(2pi-x) cos(2pi-x)

The table also displays the 3 MSB bit values corresponding to each range when the LUT size is a power of 2. We could use these to select the sin/cos expressions to compute. Alternatively, the Migen implementation uses some clever bit manipulations (described below) to arrive at the phase. Our implementation should probably use the same procedure, but it's worth actually benchmarking this to see which is faster. The process used by Migen is:

  1. Store the MSB bits and then mask off the MSB bits from the phase. This subtracts the lower bound of the range from the phase. Denote the MSBs as z0, z1 and z2 proceeding from most to least significant.
  2. If z2 is 1 we change our phase to be pi/2-x. If z2 is 0, leave the phase unchanged.
  3. If z1^z2 is 1 swap the sin and cos.
  4. Finally, if z0^z1 is 1 negate the cos value and if z0 is 1 negate the sin value.

The final step is to perform interpolation. Two options for this are linear interpolation between the bounding values in the LUT or a linear interpolation based on the derivative value at each point. The Migen implementation devotes some of the LUT bits to the derivative. However, since the linear interpolation method using bounding values does not require any additional storage and is cheap to compute, I believe it makes more sense in this application.

If the integer phase range is 8x that of the LUT (remember, the LUT only stores sin/cos for phases between 0 and pi/4), we can use it directly for indexing the LUT (mask off the 3 MSBs). If the integer phase range is less than 8x the LUT index range but a power of 2 we can left shift the input (after masking off the 3 MSBs) and use the resulting phase to directly index the LUT.

If the range is greater than 8x the LUT index range but a power of 2 we can store the LSBs corresponding to the difference in size. We then right-shift the input phase by this amount and use that value to index the LUT. In the case of linear interpolation, we also retrieve the next higher index value from the LUT. These correspond to the lower and upper bounds of a linear interpolation. We take the difference of the bounds, multiply by the value in the LSBs and right shift by the number of LSBs. This is added to the lower bound. By using fixed-point values, we can do the interpolation using 2 memory accesses, an integer subtraction, an integer multiplication, an integer bit shift and an integer addition. Everything except for the memory accesses must be done twice: once each for the sin and cos.

When the integer phase is not a power of 2 we must map it to a power of two and then perform the steps above. The basic way to do this is to convert the phase to an f32, multiply it by a scaling factor, round and then convert back to an integer. This could be problematic for very slow reference frequencies. In that case, we could right shift down to a value that doesn't incur significant floating point errors and then perform the same process (i.e., scale, round, convert to int). This conversion to float and back again is not ideal. However, I don't see a way around it if it's possible to have arbitrary reference periods.

accuracy

We can the assess the accuracy of an implementation by testing whether it introduces errors in comparison to an "ideal" implementation in the following signal processing flow:

A value is sampled by the 16-bit ADC. That value is then multiplied by the sine function implementation using an arbitrary phase value between 0 and pi/4. The product is then quantized by the 16-bit DAC.

We approximate the ideal implementation using a double-precision floating-point implementation. I've currently used simple truncation for simplicity. But, if we'd like avoid biases introduced by intermediate trunctation we could use a non-biasing rounding method such as convergent rounding. Additionally, I've ignored the accuracy of the cos implementation, assuming that the sin analysis also reflects the cos accuracy.

import numpy as np

ADC_BITS = 16
FIXED_POINT_BITS = 16
MAX_FIXED_POINT = 1 << (FIXED_POINT_BITS - 1)
# Choosing an odd value here prevents the tested phases from directly
# coinciding with the quantized phases, which would artificially
# depress the standard deviation results.
NUM_PHASES = 507

def linear_interpolate(x, xlow, xhigh, ylow, yhigh):
    if np.isclose(x, xlow):
        return ylow
    if np.isclose(x, xhigh):
        return yhigh
    slope = (yhigh - ylow) / (xhigh - xlow)
    return ylow + slope * (x - xlow)

adc_samples = np.transpose(
    np.array([np.arange(-(1 << (ADC_BITS - 1)), (1 << (ADC_BITS - 1)))])
)
phases = np.linspace(0, np.pi / 4, NUM_PHASES)
sin_ideal = np.array([np.sin(phases)])

def error(lut_depth):
    lut_size = 1 << lut_depth

    # need the floor and ceiling for interpolation
    quantized_phases = [
        (
            np.floor(phase * lut_size) / lut_size,
            phase,
            np.ceil(phase * lut_size) / lut_size,
        )
        for phase in phases
    ]
    # shift the phase values up by half the minimum phase increment
    phase_adjust = 0.5 * np.pi / 4 / lut_size
    sin_approx = np.array(
        [
            [
                # Assumes a simple truncation. This introduces a bias
                # since it always rounds down. If we need something
                # better we could use convergent rounding.
                int(
                    linear_interpolate(
                        mid,
                        floor,
                        ceil,
                        np.round(
                            np.sin(floor + phase_adjust) * MAX_FIXED_POINT
                        ),
                        np.round(
                            np.sin(ceil + phase_adjust) * MAX_FIXED_POINT
                        ),
                    )
                )
                for (mid, floor, ceil) in quantized_phases
            ]
        ]
    )

    dac_ideal = np.round(np.matmul(adc_samples, sin_ideal))
    # Again, simple truncation. We can do better if necessary
    dac_approx = np.int32(np.matmul(adc_samples, sin_approx) / MAX_FIXED_POINT)
    return dac_ideal - dac_approx

print("{:5}  {:<11}  {:<7}  {:<3}".format("depth", "mean", "stddev", "max"))
for depth in range(5, 20):
    err = error(depth)
    print(
        "{:5}  {:<11.4e} {:8.3f} {:4}".format(
            depth, np.mean(err), np.std(err), np.int32(np.max(np.abs(err)))
        )
    )

Output:

depth  mean         stddev   max
    5  -1.4306e-03  163.557  651
    6  -7.4386e-04   83.038  425
    7  -3.8644e-04   41.526  249
    8  -1.8602e-04   20.781  110
    9  -1.0528e-04   12.146  182
   10  -5.2729e-05    6.222   91
   11  -2.8953e-05    4.039   79
   12  -1.8539e-05    3.021   60
   13  -1.2219e-05    2.171   40
   14  -1.0955e-05    2.065   55
   15  -3.3106e-06    1.127   16
   16  -1.8660e-06    0.973   10
   17  2.7087e-07     0.764    6
   18  1.8058e-07     0.768   10
   19  3.9125e-07     0.708    4

A reasonable choice of LUT depth is probably in the range of 10-13 bits (4-32KiB). Although a standard deviation less than 1 (which would mean most results do not suffer any accuracy loss) would be great, this would require a 256KiB LUT, which is probably excessive.

alternative implementations

CMSIS uses a floating-point output LUT. It stores values for the full range 0 to 2pi using 512 values (depth of 9) and then performs linear interpolation on phases in-between. Nothing about this implementation is particularly compelling. Based on the analysis performed earlier for the Migen implementation, this implementation should have poor accuracy and high memory usage.

We could perform a variant of this implementation with an integer input and fixed-point output. Additionally, we could omit the logic used to confine the phase to be within the 2pi range (actually the integer equivalent). This resulting implementation would be similar to the Migen implementation except that it would omit the phase mapping stage and require 8x the storage for the same accuracy. Since the phase mapping only requires a few operations, this is probably a bad tradeoff.

floating-point polynomial approximation

libm/musl

The libm implementation first stores the sign bit, masks it off the phase and then performs a branch of the now positive phase value. x represents the phase. The ranges are:

  1. 0 <= |x| <= 2**-12
  2. 2**-12 <= |x| <= pi/4
  3. pi/4 <= |x| <= 3pi/4
  4. 3pi/4 <= |x| <= 5pi/4
  5. 5pi/4 <= |x| <= 7pi/4
  6. 7pi/4 <= |x| <= 9pi/4

It then performs an argument reduction if needed. Since this part is not relevant to us it won't be discussed in detail. However, it's worth noting that the argument reduction isn't attempted until after we check to see if we're in a phase range less than 2pi. Therefore, it shouldn't incur a performance penalty when not needed. It is also worth noting that this argument reduction uses 64-bit floating-point intermediate values. An upstream change to libm sincosf would also need to remedy this. This would be development effort not directly applicable to stabilizer.

If the phase is in range (1) we return sin and cos values directly. When in range (2) the polynomial approximation is performed on the phase directly. In all ranges, the each polynomial approximation (sin and cos) has to work for phase values between -pi/4 and pi/4. The rest of the phase mapping is thematically similar to the Migen implementation. The first obvious difference is that branches are used to select the appropriate angle and trig function to compute rather than the clever bit manipulations Migen can use because it uses power of 2 fixed-point. Additionally, libm must deal with negative arguments. The fact that libm must handle negative phase values is another downside to this implementation from our perspective. Moreover, the implications of this is worse than just unneeded phase value checks since the polynomial approximation is designed to be accurate between -pi/4 and pi/4, whereas ours only needs to be accurate between 0 and pi/4. We should be able to get improved accuracy/performance by centering our expansion at pi/8 instead of 0.

The polynomial approximation is just a Taylor series using the first 5 terms. In other words, sin is computed as

x - x**3/3! + x**5/5! - x**7/7! + x**9/9!

and cos is computed as

1 - x**2/2! + x**4/4! - x**6/6! + x**8/8!

The powers are computed and the coefficients are stored as constants. An upstream change to libm should only need to change these coefficients to be single-precision floating point and change the function signature to take a single-precision phase value. To understand why libm uses 5 terms, it's helpful to plot the error and the 32-bit floating point precision between 0 and pi/4. This is shown in the plot below.

import numpy as np
import matplotlib.pyplot as plt

def sin_taylor(x, terms=5):
    x = np.single(x)
    return np.sum(
        [
            np.single(
                (-1) ** n * x ** (2 * n + 1) / np.math.factorial(2 * n + 1)
            )
            for n in range(terms)
        ]
    )

def cos_taylor(x, terms=5):
    x = np.single(x)
    return np.sum(
        [
            np.single((-1) ** n * x ** (2 * n) / np.math.factorial(2 * n))
            for n in range(terms)
        ]
    )

phases = np.linspace(0, np.pi / 4, 1000)
vsin = np.vectorize(sin_taylor)
sin_error_4 = np.sin(phases) - np.double(vsin(phases, 4))
sin_error_5 = np.sin(phases) - np.double(vsin(phases, 5))
tolerance = np.sin(phases) * np.finfo(np.single).eps
plt.plot(phases, tolerance)
plt.plot(phases, np.absolute(sin_error_4))
plt.plot(phases, np.absolute(sin_error_5))
plt.legend(("FP32 precision", "4 terms", "5 terms"))
plt.show()

sin_taylor_tol

For 5 terms, the error is almost always below the floating-point precision of the corresponding value. Moreover, adding extra terms does not improve the accuracy. Therefore, 5 terms provides the cheapest computation without incurring loss of accuracy.

newlib

Newlib has a sincos implementation but it is just separate calls to sin and cos. This means that all argument reduction code is duplicated. The sin and cos implementations are reasonably similar to those of libm. The general idea is to restrict the argument to the range -pi/4 to pi/4 and then perform a Taylor expansion. Newlib's argument reduction differs from that of libm. Whereas libm performs a branch directly to an appropriate sin or cos Taylor expansion for any absolute phase less than 9pi/4 before performing a more general argument reduction, newlib performs this reduction for anything outside of -pi/4 to pi/4. For our purposes, libm's strategy is probably superior to that of newlib. To assess the performance implications of this decision for more general purposes would require benchmarking.

Newlib uses the first 7 terms of the Taylor series as an approximation. This choice doesn't make sense to me, since (based on the analysis above) any terms above 5 do not provide additional accuracy.

My initial impression is that the libm implementation is better than that of newlib, though benchmarks would be needed to verify this if we do want to use a floating-point implementation.

custom solution

I expect that one of the major downsides to implementing a 32-bit sincos function in libm is the need to implement 32-bit rem_pio2f and rem_pio2_large implementations. These account for a large amount of the code used in sincos and are not needed for our case. The logic in sincos apart from rem_pio2f is not particularly complicated, and therefore a custom solution is a reasonable possibility. Still, for the sake of maintenance and because the full implementation should not incur a performance penalty, it would still probably be preferable to implement a full solution in libm and use that if we go the floating-point route.

additional thoughts

There are a number of important considerations for deciding between a fixed-point implementation and a floating-point implementation:

  1. Do we want to transition to integer math for the rest of the DSP (e.g., the IIR implementation)? If so, a fixed-point solution makes the most sense.
  2. What is the difference in computational complexity between the LUT implementation and polynomial expansion implementation? Although the LUT implementation should be faster, a proper comparison would test both on an STM32H743.
  3. The short-term development efforts for the LUT implementation and the custom polynomial expansion implementations should be similar and relatively minimal after this analysis. The short-term development effort required for implementing a polynomial approximation in libm would be greater than each of the first 2.
  4. The long-term development effort (maintenance cost) of the libm implementation would be lower than each of the other two implementations. I feel this is more important than short-term development effort, so if a floating-point implementation is desired then the libm solution is most likely preferable.

A fixed-point LUT implementation (which is similar to the one used in misoc) seems to be the best solution at this time.

Analyses of atan2 and sqrt implementations will follow.

TODO atan2

TODO sqrt

jordens commented 3 years ago

Thanks!

Your LUT depth test should give at least the results of the migen cossin (which is bit-accurately modeled and analyzed in the notebook) because that uses very coarse pre-scaled derivatives to save two multipliers. To get 1.4 LSB max and 0.6 LSB rms error we only need a 10 deep LUT (for one octant), 4 kB.

I think the relevant newlib implementation of sin/cos/sincos is not what you link to. You should look at this one (.h).

Comparing the current (double) libm impl with a (float) newlib is probably not meaningful. That's also the reason one has 7 terms and the other 5.

As mentioned, the phase will be fixed point anyway at its source. Converting it to f32 and back (for LUT lookup) is likely a bad idea. We don't need the wrapping of a float implementation. I don't know off-hand how well an expansion with fixed point works. The reduction from 0-2pi to one quadrant/octant is pretty much the same for both implementations. The only remaining comparison (CPU only, since 4kB memory is acceptable) is series expansion vs linear interpolation on LUT data.

Let's do the following (comments welcome).

Other:

matthuszagh commented 3 years ago

I think the relevant newlib implementation of sin/cos/sincos is not what you link to. You should look at this one (.h). [...]

  • Port the good newlib impl to rust. Benchmark it against (FFI, same test harness) the C impl. Verify that it makes sense.

This implementation uses double-precision intermediate values. If I'm going to adapt this to an all single-precision version, does it also make sense to do the same for the current rust libm version and compare them?

As mentioned, the phase will be fixed point anyway at its source. Converting it to f32 and back (for LUT lookup) is likely a bad idea.

I'm still not entirely clear on this. For instance, if we get a reference period of 100 (in units of the internal counter period) and the LUT table has size 1024, then to index the LUT we first need to scale input timestamps so that the reference period is some power of 2. For instance, the scaling factor on ADC phases in this example could be 128/100. This requires conversion to f32, rounding and conversion back to integer. Or, am I missing something here?

Note that to get the midpoint data and its derivative, you still only ever have to look up one i16x2 midpoint entry, not two adjacent ones.

I believe this also requires intermediate floating-point math (for the pi/4 scaling factor). Although I guess I could approximate this with integer math (i.e., 3/4).

* `atan2`: I have no idea about the implementations and whether there are significant difference between f32/i32. I think it's likely that we want `f32` inputs here.

Worth noting that the current libm implementation already uses f32 exclusively. It'll be worth evaluating the performance, of course, but there may not be much work to do here. Still, I'll see what I can find about i32 implementations.

* Re `sqrt`. I think we can live without it for two reasons: (a) there is `vsqrt` on the architecture and (b) having to deal with the squared magnitude instead of the linear one will be fine AFAICT.

Ok.

jordens commented 3 years ago

This implementation uses double-precision intermediate values. If I'm going to adapt this to an all single-precision version, does it also make sense to do the same for the current rust libm version and compare them?

That item is after the LUT checkpoint (not now). My assumption is that those ARM people tuned and optimized the algo to a representative and relevant mix of embedded ARM cores, including ours. We do have a double precision FPU (but AFAICT aren't really well equipped to use it due to LLVM). Iff we go past the checkpoint, I'd just stick to that impl more or less verbatim first (assuming the concept is optimal). And then if it's noticably bad, maybe try again with single precision.

I'm still not entirely clear on this. For instance, if we get a reference period of 100 (in units of the internal counter period) and the LUT table has size 1024, then to index the LUT we first need to scale input timestamps so that the reference period is some power of 2.

Right. I had updated #170 describing the math a couple days ago. We need to agree on the math in that issue first. Yes, there is a division in there. But I'm not sure whether it's better to do that in (a) i32 with some numerator/denominator scaling or clever division algorithm, (b) i64, or (c) f32. But in any case ultimately we have integer input data (from the PLL) and for a LUT we always need an integer modulo power of two. What we do in between is certainly open for discussion. I don't know how much penalty i32 -> f32 division -> i32 over i32 -> something with i32 or i64 -> i32 is.

I believe this also requires intermediate floating-point math (for the pi/4 scaling factor). Although I guess I could approximate this with integer math (i.e., 3/4).

In the worst case the linear interpolation by delta_phi from midpoint is (delta_phi: i32) * ((lut_data: i16) as i32) * (scaled_pi4: const i32). It's pretty much for free. No?

Worth noting that the current libm implementation already uses f32 exclusively. It'll be worth evaluating the performance, of course, but there may not be much work to do here. Still, I'll see what I can find about i32 implementations.

I wonder whether it is actually important for us to be accurate here. It may be fine to accept quite some deviation from atan2 as long as the points on the i/q axes are correctly mapped to the right angles to e.g. within 1e-5. Maybe there are relevant approximations out there.

matthuszagh commented 3 years ago

That item is after the LUT checkpoint (not now). My assumption is that those ARM people tuned and optimized the algo to a representative and relevant mix of embedded ARM cores, including ours. We do have a double precision FPU (but AFAICT aren't really well equipped to use it due to LLVM). Iff we go past the checkpoint, I'd just stick to that impl more or less verbatim first (assuming the concept is optimal). And then if it's noticably bad, maybe try again with single precision.

Ok.

Right. I had updated #170 describing the math a couple days ago. We need to agree on the math in that issue first. Yes, there is a division in there. But I'm not sure whether it's better to do that in (a) i32 with some numerator/denominator scaling or clever division algorithm, (b) i64, or (c) f32. But in any case ultimately we have integer input data (from the PLL) and for a LUT we always need an integer modulo power of two. What we do in between is certainly open for discussion. I don't know how much penalty i32 -> f32 division -> i32 over i32 -> something with i32 or i64 -> i32 is.

Ok.

In the worst case the linear interpolation by delta_phi from midpoint is (delta_phi: i32) * ((lut_data: i16) as i32) * (scaled_pi4: const i32). It's pretty much for free. No?

Yes, true. So, getting the interpolation for a sin value could be ((cos LSBs 804) >> (depth + #(LSBs) + 10) (and adjusted for half-up rounding). 804, 10 is just one set of values we could use to approx pi/4. LSBs is the LSB values giving the phase interval between successive LUT values and #(LSBs) is the number of LSB bits.

I wonder whether it is actually important for us to be accurate here. It may be fine to accept quite some deviation from atan2 as long as the points on the i/q axes are correctly mapped to the right angles to e.g. within 1e-5. Maybe there are relevant approximations out there.

Ok. We can investigate this after the first checkpoint.

jordens commented 3 years ago

Yes, true. So, getting the interpolation for a sin value could be ((cos LSBs 804) >> (depth + #(LSBs) + 10) (and adjusted for half-up rounding). 804, 10 is just one set of values we could use to approx pi/4. LSBs is the LSB values giving the phase interval between successive LUT values and #(LSBs) is the number of LSB bits.

Use the full i32 range for the intermediate calculations. The i32 phase will be split into: 3 octant bits, N LUT bits, M interpolation bits. Having Z = 20 total phase bits going into the computation is sufficient for the accuracy/quantization we need. So it'll be something like N = 10, 7 <= M that need to go into the interpolation. Then use a K bit fixed point representation for the pi/4 constant to max out the i32 multiplication range before bias addition and right shifting. It might be most accurate to distribute the 32 bit evenly among the three factors 17 (or 18) bit LUT, M, and K, like 11, 11, 10) or doing two separate i32 multiplications and two biasing and two shifts. Watch out where you actually need the sign and where it is know throughout the octant. I think only the M delta-phi bits are signed. The rest is fixed sign. The misoc core should have all that figured out.

Note that the LUT data is 17/18 bit effectively with 2x16 bit storage because of the known sign. Using the extra bit (since 1/2 < cos(x) <= 1 in the first octant) is also cheap and interesting.

It looks like with that we should be able to exceed Complex<i16> output accuracy by a few bits. Let's go for Complex<i32> output to not be limited by that.

jordens commented 3 years ago

With all that we may or may not be limited by the quadratic term. It's not worth optimizing beyond that.

p.s. Max error is 1-cos(1/2*pi/4*1/(2 << depth)) < 1e-7 for 10 bit deep LUT so still a couple of bits out. I wonder whether we could even go to 8 bit deep LUT and still be better than 18 bit output with full accuracy linear interpolation. @matthuszagh let's see whether your impl confirms that.

matthuszagh commented 3 years ago

Closing since #197 has been merged. atan2 analysis will be opened in a new issue.