Issue

After updating the Rust float parsing algorithms to improve performance, I've noticed in my own code that performance can be optimized still for slow-path algorithms, using a limited subset of big-integer arithmetic, an implementation that is easy-to-understand, safe, requires little code, and is at least 6x faster than the existing decimal implementation. The new code is extremely efficient, and out-performs all other float parsing implementations for up to 6400 digits (and very likely higher), often by 10x or more. It even beats glibc's implementation in all cases (which although is slow for common cases is very efficient for its slow-path algorithm), which last I checked uses GMP under-the-hood, a very good metric of performance.

Detailed Description

In order to accurately parse floats, in near-halfway cases (described in depth in the PR referenced above), we need to use a large integer representation of our significant digits, since a truncated representation has rounding error that is ambiguous with a halfway point, obscuring rounding.

The existing implementation uses a fixed-width decimal type, which has a few notable drawbacks:

To scale the significant digits, it requires shifts on every significant digit in the buffer.
It makes heavy use of magic numbers, which although are well-tested, may not be immediately understandable to a maintainer.
Performance can dwindle heavily in certain near-halfway cases, and is generally 6x slower and can be up to 45x slower.

An improvement is to use the fact we can ensure we are within 1 ULP of the correct result from the Lemire algorithm, and therefore use big-integer arithmetic to determine how to round. This is specialized for two cases: where the exponent is positive relative to the significant digits, and when it's negative.

The fix is relatively simple, and is described in-depth in a PR to the C++ implementation of which Rust's dec2flt is based on:

Have a stack-allocated, fixed-width, vector-like type.

This is quite easy, since we only need a stack-allocated vector with enough bits to hold 10^(769 + 342), or ~4000 bits, where 769 is the maximum number of significant digits required to accurately round a float (with an extra 2), and 342 is the largest magnitude of an exponent (for the smallest denormal float, 5e-342). Our vector-like class only needs a few methods:

Push/pop elements to and from the vector.
Extract the high 64-bits from the class.
Resize/extend the vector from a slice.
Implement total ordering and equality assuming little-endian limb order.
Get the length, capacity, and if the vector is empty.

Since the vector only will contain integer types, we should't need to worry about dropping them manually if using MaybeUninit. A reference implementation can be seen here.

Implement a very limited subset of big-integer arithmetic.

The only real operations we need are the following: scalar addition, scalar multiplication, addition and multiplication of a scalar value to a big integer, and addition and multiplication between two big integers. Everything else is derived trivially. We can use simple algorithms as well, for the number of limbs we use, asymptotically faster algorithms like Karatsuba multiplication are actually slower. This means we can use algorithms like grade-school multiplication, which is easily shown to be correct.

We then need to make sure the limbs are normalized, which means that any most-significant limbs with zero values are removed. In little-endian limb order, this is very easy since we just pop from the end of the vector. We then need algorithms to exponentiate by a power of 2, 5, and 10.

Exponentiating by a power-of-two is quite easy, since we can merely shift the bits: break the shift into bits within a limb, and the number of limbs to shift:

/// Shift-left buffer by n bits.
#[inline]
pub fn shl(x: &mut VecType, n: usize) -> Option<()> {
    let rem = n % LIMB_BITS;
    let div = n / LIMB_BITS;
    if rem != 0 {
        shl_bits(x, rem)?;
    }
    if div != 0 {
        shl_limbs(x, div)?;
    }
    Some(())
}

The implementation of both shl_bits and shl_limbs is shown in the reference implementation below. To exponentiate by a power-of-5, we can do it iteratively by multiplication by powers-of-5. In order to do so efficiently, we use 1 pre-computed power-of-5. or 5^135, which is chosen because it's 5 times the largest native power, or 5^27, and showed good performance in benchmarks.

/// Pre-computed large power-of-5 for 32-bit limbs.
#[cfg(not(all(target_pointer_width = "64", not(target_arch = "sparc"))))]
pub const LARGE_POW5: [u32; 10] = [
    4279965485, 329373468, 4020270615, 2137533757, 4287402176, 1057042919, 1071430142, 2440757623,
    381945767, 46164893,
];

/// Pre-computed large power-of-5 for 64-bit limbs.
#[cfg(all(target_pointer_width = "64", not(target_arch = "sparc")))]
pub const LARGE_POW5: [u64; 5] = [
    1414648277510068013,
    9180637584431281687,
    4539964771860779200,
    10482974169319127550,
    198276706040285095,
];

/// Step for large power-of-5 for 32-bit limbs.
pub const LARGE_POW5_STEP: u32 = 135;

pub fn pow(x: &mut VecType, mut exp: u32) -> Option<()> {
    // Minimize the number of iterations for large exponents: just
    // do a few steps with a large powers.
    while exp >= LARGE_POW5_STEP {
        large_mul(x, &LARGE_POW5)?;
        exp -= LARGE_POW5_STEP;
    }

    // Now use our pre-computed small powers iteratively.
    // This is calculated as `⌊log(2^BITS - 1, 5)⌋`.
    let small_step = if LIMB_BITS == 32 {
        13
    } else {
        27
    };
    let max_native = (5 as Limb).pow(small_step);
    while exp >= small_step {
        small_mul(x, max_native)?;
        exp -= small_step;
    }
    if exp != 0 {
        // SAFETY: safe, since `exp < small_step`.
        let small_power = unsafe { f64::int_pow_fast_path(exp as usize, 5) };
        small_mul(x, small_power as Limb)?;
    }
    Some(())
}

This requires minimal static storage (smaller than the decimal class's requirements), is intuitive (and therefore easy to validate), and performant.

Finally, we can use that 10^N := 5^N * 2^N, so we can implement efficient multiplication by a power-of-10 as multiplication by a power-of-5 and bitshifts.

A reference implementation can be seen here.

Note: Our algorithm to parse the significant digits as a big-integer is uninteresting, but can be seen here. It will not be described in depth.

Ensure efficient scalar operations for limbs in the big-integer type.

On 64-bit architectures, almost all tested systems seem to support efficient 128-bit multiplication except for SPARC (v8 and v9). The code used to ensure this can be found here. Tested systems that are known to explicitly support 128-bit multiplication include x86_64, MIPS64, and s390x. Platforms where 64-bit high and low multiplication is supported include ARM64, PPC64, and RISC-V64. On all 32-bit systems and SPARC, we use 32-bit limbs, otherwise, we use 64-bit limbs.

An algorithm to determine the correct way to round for exponents positive relative to the significant digits.

This is quite easy: if we have a float like 12345e300, we can break this down as 12345 * 10^300, which then means we can create a big-integer representation, and merely pool the most-significant 64-bits. In the case of an exact tie with a halfway point, we can check if any of the truncated bits are non-zero, and use that to direct round-nearest, tie-even.

A reference implementation can be found here.

An algorithm to determine the correct way to round for exponents negative relative to the significant digits.

This is slightly trickier, but also quite simple: say we have a float like 12345e-300. we can break this down as 12345 / 10^300. However, dividing two big integers is not ideal, since big-integer division is slow. A much faster approach is to create a theoretical representation of the digits of the halfway point, and then compare it to our actual digits.

For this, I'll use the following terminology: b is the float rounded-down, or below the halfway point. b+u is the next positive float larger than b, and b+h is exactly halfway between them (cannot be represented natively).

To create a native representation, we can first round-down our extended-precision float calculated from the Eisel-Lemire algorithm to b and convert it to an extended-representation. This is shown here.

Next, want to scale both the real significant digits and the real significant digits to be to the same exponent, using only multiplication. We therefore have theor_digits * 2^X, and real_digits * 10^Y, where Y < 0. We can then compare theor_digits * 2^X cmp real_digits * 10^Y, which means we can re-arrange it as theor_digits * 2^X / 10^-Y cmp real_digits. We can break this down as theor_digits * 2^(X-Y) / 5^-Y cmp real_digits. In practice, this is quite simple to implement, and quite efficient:

let binary_exp = theor_exp - real_exp;
let halfradix_exp = -real_exp;
if halfradix_exp != 0 {
    theor_digits.pow(5, halfradix_exp as u32).unwrap();
}
if binary_exp > 0 {
    theor_digits.pow(2, binary_exp as u32).unwrap();
} else if binary_exp < 0 {
    real_digits.pow(2, (-binary_exp) as u32).unwrap();
}

Now, we just need to compare the real and significant digits, and direct our rounding for fp (our extended-precision float) accordingly:

// Compare our theoretical and real digits and round nearest, tie even.
let ord = real_digits.data.cmp(&theor_digits.data);
round::<F, _>(&mut fp, |f, s| {
    round_nearest_tie_even(f, s, |is_odd, _, _| {
        // Can ignore `is_halfway` and `is_above`, since those were
        // calculates using less significant digits.
        match ord {
            cmp::Ordering::Greater => true,
            cmp::Ordering::Less => false,
            cmp::Ordering::Equal if is_odd => true,
            cmp::Ordering::Equal => false,
        }
    });
});
fp

Modify the Eisel-Lemire algorithm to return b, rather than a generic error.

This is quite simple, since we already have an extended-precision representation is within the interval [b, b+u]. We must add a marker to ensure we know when an error occurs (we use a biased exponent less than 0, additionally biased by i16::MIN for this purpose), and ensure we have an accurate representation of our significant digits.

To do this, we add two functions, which are just small extensions of the existing algorithm:

/// Fallback algorithm to calculate the non-rounded representation.
/// This calculates the extended representation, and then normalizes
/// the resulting representation, so the high bit is set.
#[inline]
pub fn compute_error<F: Float>(q: i32, mut w: u64) -> ExtendedFloat {
    let lz = w.leading_zeros() as i32;
    w <<= lz;
    let hi = compute_product_approx(q, w, F::MANTISSA_SIZE as usize + 3).1;
    compute_error_scaled::<F>(q, hi, lz)
}

/// Compute the error from a mantissa scaled to the exponent.
#[inline]
pub fn compute_error_scaled<F: Float>(q: i32, mut w: u64, lz: i32) -> ExtendedFloat {
    // Want to normalize the float, but this is faster than ctlz on most architectures.
    let hilz = (w >> 63) as i32 ^ 1;
    w <<= hilz;
    let power2 = power(q as i32) + F::EXPONENT_BIAS - hilz - lz - 62;

    ExtendedFloat {
        mant: w,
        exp: power2 + F::INVALID_FP,
    }
}

Then, if we are not within a safe exponent range (within the range [-27, 55]), we returned a scaled error from our already scaled significant digits:

let inside_safe_exponent = (q >= -27) && (q <= 55);
if !inside_safe_exponent {
    return compute_error_scaled::<F>(q, hi, lz);
}

Likewise, if our significant digits were truncated and the algorithm for mantissa + 1 gives us a different result than the original pass, we can compute the non-scaled error:

let mut fp = compute_float::<F>(num.exponent, num.mantissa);
if num.many_digits && fp.exp >= 0 && fp != compute_float::<F>(num.exponent, num.mantissa + 1) {
    // Need to re-calculate, since the previous values are rounded
    // when the slow path algorithm expects a normalized extended float.
    fp = compute_error::<F>(num.exponent, num.mantissa);
}
fp

Although we could technically make this more efficient, it's extremely cheap computationally relative to big-integer arithmetic, it's easy to justify logically, and it avoids any performance regression for common cases.

A reference implementation can be seen here.

Pull Request

If there is interest, I will gladly submit a PR. I own all of the aforementioned code, including the algorithms involved, so there are no licensing issues whatsoever.

rust-lang / rust

Improve Performance in Float Parsing for Decimal Slow-Path Algorithm #88656

Issue

Detailed Description

Pull Request