rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.02k stars 12.68k forks source link

RangeInclusive<Integer>::sum Optimization Suggestion #94922

Closed ghost closed 2 years ago

ghost commented 2 years ago

The idiomatic way to sum a range of integers from a up to b is simply (a..=b).sum(). Yet, this always emits suboptimal codegen, as LLVM cannot perfectly optimise this, yet slightly better mathematics can be employed to do the same, more efficiently.

I suggest that specialisation be applied to range inclusive for integer elements where T::BITS < 128, to exchange a default internal implementation of sum for a better optimised implementation. (Note that when an intrinsic is created for u128::carrying_{op} to yield the carry value, as opposed to just a boolean indicator, then this could be implemented for all types using carrying arithmetic; tag me then?)

given a, b, of integer type T, and the immediately wider type Wider, one can compute the sum via

// a: T, b: T
let a = Wider::from(a);
let b = Wider::from(b);
let sum_result = ((a + b) * (1 + b).saturating_sub(a) / 2);
// the intermittent values may exceed the bounds of {T}, but the result could still fit
if overflow_checks_enabled && !(Wider::from(T::MIN)..=Wider::from(T::MAX)).contains(&sum_result) {
    panic!("integer underflow/overflow")
} else {
    // if overflow checks are disabled, casting will work as expected, otherwise it fits as-is
    return sum_result as T
}

If this wasn't the right place to have made the issue, please redirect me as appropriate.

ghost commented 2 years ago

Update, barely coming back around to this, implemented a special case for u128. I'm using

sum: Z² -> Z
sum(x, y) = [x ≤ y] ⋅ ⌊(x + y) ⋅ (y - x + 1) ÷ 2⌋ (mod 2ⁿ)

This is incorrect because the intermittent operations are modulo 2ⁿ, and the final result after the division has (n - 1) bits. We can divide the even factor between x + y and y - x + 1 (note that -x ≡ x (mod 2), so these two factors have strictly unequal parities), then multiply, and only the factor being divided will need to use carrying instructions.

sum(x, y) = [x ≤ y] ⋅ {
 ⌊(x + y) ÷ 2⌋ ⋅ (y - x + 1), if 2 | (x + y),
 (x + y) ⋅ ⌊(y - x + 1) ÷ 2⌋, else
}

I rearrange the expression so that only one side needs any carrying instructions:

sum(x, y) = [x ≤ y] ⋅ ⌊(a ⋅ x + y + b) ÷ 2⌋ ⋅ (c ⋅ x + y + d)
a = { +1, if (x + y) is even, -1, else }
b = [(x + y) is odd]
c = { +1, if (x + y) is odd, -1, else }
d = [(x + y) is even]

Concrete implementation looks like so:

type T = usize;

// f(x, y) = (x + y ⋅ 2ⁿ) ÷ 2 = x ÷ 2 + y ⋅ 2^(n - 1)
fn divide_by_2_with_carry(
 x: T,
 carry: bool
) -> T {
 (x >> 1) | (T::from(carry) << (T::BITS - 1))
}

// (cneg(x, y), (0 != x) & y)
fn overflowing_cneg(
 x: T,
 y: bool
) -> (T, bool) {
 if y {
  x.overflowing_neg()
 } else {
  (x, false)
 }
}

fn cneg(x: T, y: bool) -> T {
 if y {
  x.wrapping_neg()
 } else {
  x
 }
}

// sum(x, y) = [x ≤ y] ⋅ (x + y) ⋅ (y - x + 1) ÷ 2
pub fn sum(x: T, y: T) -> T {
 T::from(x <= y) * {
  let is_odd = 1 == 1 & (x ^ y);
  let a = {
   let (result, underflowed) = overflowing_cneg(x, is_odd);
   let (result, overflowed_1) = result.overflowing_add(y);
   let (result, overflowed_2) = result.overflowing_add(T::from(is_odd));
   let carry_bit = underflowed ^ overflowed_1 ^ overflowed_2;
   let result = divide_by_2_with_carry(
    result,
    carry_bit
   );
   result
  };
  let b = cneg(x, !is_odd) + y + T::from(!is_odd);
  a * b
 }
}

(bit ugly without comments, but if it gets in, I'll clean it up) I'll try discussing this on the Zulip soon, and seeing whether libs is interested in the change.

ghost commented 2 years ago

Never got around to discussing on Zulip; closing in favour of letting this be fixed by LLVM auto-vectorisation of inclusive ranges instead, whenever that happens.

See https://github.com/rust-lang/rust/issues/45222#issuecomment-363008094