tc39 / proposal-math-sum

TC39 proposal to add a summation method to JavaScript
https://tc39.github.io/proposal-math-sum
70 stars 1 forks source link

Deterministic sum #1

Open waldemarhorwat opened 7 months ago

waldemarhorwat commented 7 months ago

I propose making the sum be the deterministic mathematical sum of the inputs, rounded to the nearest double, breaking ties to even in the typical IEEE double fashion. The special cases are as for regular addition:

This can be implemented as follows, using the exact intermediate sum technique which takes an arbitrary number of finite IEEE double addends a0, a1, …, an-1 and, barring intermediate or final overflows, represents their exact mathematical sum a0 + a1 + … + an-1 as s = p0 + p1 + …+ pk-1, where the p's are IEEE doubles, all have the same sign, are in descending magnitude order, are non-overlapping, meaning that each successive p's exponent is at least 53 less than the previous one, and are nonzero (except possibly p0).

  1. If there are no addends, the result is -0.
  2. If there is one addend a0, the result is a0.
  3. If there are two addends a0 and a1, the result is a0 + a1 using ordinary IEEE double arithmetic.
  4. If any of the addends are non-finite (±∞ or NaN), ignore all of the finite addends, and return the sum of the non-finite addends using ordinary IEEE double arithmetic done in any order.
  5. If all of the addends are -0, return -0 (depending on the implementation of exact intermediate sums, this case might fall out of the implementation of step 6 without needing to explicitly check for it.)
  6. Initialize an exact intermediate sum t = «pi» to t = «». Successively add addends into t using the exact intermediate sum technique above. To avoid false intermediate overflows, when picking the next addend to add into t, always prefer to pick an addend with the opposite sign from t if any exist. This way any intermediate overflow that happens is a true overflow and the mathematical sum will not be cancelled by later addends back into the finite IEEE double range.
  7. Round p0 + p1 + …+ pk-1 to an IEEE double by adding them using double arithmetic and return that result, taking special care to adjust the lsb around the round-to-even case in the case of a mathematical result exactly in between two representable doubles (see Python's math_fsum code for an explanation of this).

This is the simplest approach. The algorithm is asymptotically linear in the number of addends n and efficient, typically having k no more than 2.

The steps can be folded into a mostly or fully single-pass algorithm. For example:

If one wishes to do a fully single-pass algorithm, one can avoid the scan for addends of the opposite sign from the running total by scaling down the most significant partial p0 and values added into it by 52 powers of 2 if it would overflow, while leaving the other partials p1, …, pk-1 unscaled. This is a bit tedious requiring scaling when working across the first and second partial but can be done efficiently.

waldemarhorwat commented 7 months ago

If the partial values «p0, p1, …, pk-1» can differ in sign, then it's possible to get obscure cases where the mathematical value about to be stored into p0 is on a knife edge that rounds to ±∞ while p1 would have the opposite sign that would cause the rounding to prefer the largest-magnitude finite value. This would cause a problem for the pick-next-addend-with-opposite-sign approach, but would be handled by the single-pass scaling-down-p0 approach.

bakkot commented 7 months ago

Here is an implementation using the single-pass scaling approach. This was indeed a bit tedious, but I think it's correct - I also wrote code to convert doubles to/from BigInts (scaled up by 2**1023), in which domain we can do addition/subtraction precisely, and used this to fuzz my implementation. This caught a few nasty edge cases. It's possible I might still have missed a case somewhere.

Note that, following Python, the partials are in ascending order of magnitude, so the "overflow" partial is conceptually last in this implementation, not first.

bakkot commented 7 months ago

In the original issue for adding fsum to python, there is attached a version based on a different approach:

In brief, the accumulated sum-so-far is represented in the form

huge_integer * 2**(smallest_possible_exponent)

and the huge_integer is stored in base 2**30, with a signed-digit representation (digits in the range [-2**29, 2**29).

Because this approach doesn't use 2Sum, it doesn't have issues with overflow.

I've extracted it from the diff in the thread. I'm probably not going to bother implementing this in JS as well, but it does prove the possibility.

lsum.c ```c /* Full precision summation of a sequence of floats. Based on the function lsum by Raymond Hettinger at , and following implementation suggestions from Tim Peters. Method: maintain the sum so far in the form huge_integer * 2**FSUM_MIN_EXP where FSUM_MIN_EXP is such that every finite double can be expressed as an integral multiple of 2**FSUM_MIN_EXP. huge_integer is stored in base FSUM_BASE. Following a suggestion from Tim Peters, we use a signed-digit representation: the digits for the base FSUM_BASE representation all lie in the range [-FSUM_BASE/2, FSUM_BASE/2). It's not hard to show that every integer (positive or negative), can be expressed uniquely in the form a0 + a1*FSUM_BASE + a2*FSUM_BASE**2 + ... This choice of representation makes it easy to deal with both positive and negative summands. Note: The signature of math.fsum() differs from __builtin__.sum() because the start argument doesn't make sense in the context of accurate summation. sum(seq2, start=sum(seq1)) may not equal the accurate result returned by sum(itertools.chain(seq1, seq2)). */ #define FSUM_MIN_EXP (DBL_MIN_EXP - DBL_MANT_DIG) #define FSUM_SIZE 30 #define FSUM_BASE ((long) 1 << FSUM_SIZE) /* allow at least 60 extra bits at the top end, to cope with intermediate overflow. On an IEEE 754 machine, FSUM_ACC_SIZE is 72. */ #define FSUM_ACC_SIZE ((DBL_MAX_EXP - FSUM_MIN_EXP + 60) / FSUM_SIZE + 1) /* _fsum_twopower[i] is 2.0**(i+1) */ static const double _fsum_twopower[FSUM_SIZE] = { 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, 1024.0, 2048.0, 4096.0, 8192.0, 16384.0, 32768.0, 65536.0, 131072.0, 262144.0, 524288.0, 1048576.0, 2097152.0, 4194304.0, 8388608.0, 16777216.0, 33554432.0, 67108864.0, 134217728.0, 268435456.0, 536870912.0, 1073741824.0, }; static PyObject * math_fsum(PyObject * self, PyObject * seq) { PyObject * item, * iter; double x, m, inf_sum = 0.0, special_sum = 0.0; int e, q, q_top, i, round_up, ndigits, top_exp; long digit, half_eps, acc[FSUM_ACC_SIZE], carry, sign; iter = PyObject_GetIter(seq); if (iter == NULL) return NULL; /* initialize accumulator */ for (i = 0; i < FSUM_ACC_SIZE; i++) acc[i] = 0; /******************************************** * Stage 1: accumulate values from iterable * ********************************************/ for (;;) { item = PyIter_Next(iter); if (item == NULL) { Py_DECREF(iter); if (PyErr_Occurred()) return NULL; break; } x = PyFloat_AsDouble(item); Py_DECREF(item); if (PyErr_Occurred()) { Py_DECREF(iter); return NULL; } /* accumulate specials in special_sum, infs in inf_sum */ if (!Py_IS_FINITE(x)) { if (Py_IS_INFINITY(x)) inf_sum += x; special_sum += x; continue; } m = frexp(x, & e); e -= (FSUM_MIN_EXP + 1); /* add digits of m into accumulator */ m *= _fsum_twopower[e % FSUM_SIZE]; q_top = q = e / FSUM_SIZE; for (;;) { digit = (long) m; m -= digit; acc[q] += digit; if (m == 0.0) break; m *= (double) FSUM_BASE; q--; } /* normalize accumulator */ for (;;) { if (acc[q] < -FSUM_BASE / 2) { acc[q] += FSUM_BASE; acc[++q]--; } else if (acc[q] >= FSUM_BASE / 2) { acc[q] -= FSUM_BASE; acc[++q]++; } else if (++q > q_top) break; } } /*************************************************** * Stage 2: compute correctly rounded return value * ***************************************************/ /* deal with any special values that occurred. See note above. */ if (special_sum != 0.0) { if (Py_IS_NAN(inf_sum)) { PyErr_SetString(PyExc_ValueError, "-inf + inf in fsum"); return NULL; } return PyFloat_FromDouble(special_sum); } /* strip zero limbs from top */ ndigits = FSUM_ACC_SIZE; while (ndigits > 0 && acc[ndigits - 1] == 0) ndigits--; if (ndigits == 0) return PyFloat_FromDouble(0.0); /* sign of result == sign of topmost nonzero limb */ sign = acc[ndigits - 1] > 0 ? 1 : -1; /* take absolute value of accumulator, renormalizing digits to lie in the range [0, FSUM_BASE) */ carry = 0; for (i = 0; i < ndigits; i++) { /* FSUM_BASE/2-1 <= acc[i]*sign + carry <= FSUM_BASE/2 */ digit = acc[i] * sign + carry; if (digit < 0) { acc[i] = digit + FSUM_BASE; carry = -1; } else { acc[i] = digit; carry = 0; } } assert(carry == 0); /* it's possible that the normalization leads to a zero top digit */ if (acc[ndigits - 1] == 0) ndigits--; assert(ndigits > 0 && acc[ndigits - 1] != 0); /* Round acc * 2**FSUM_MIN_EXP to the nearest double. */ /* 2**(top_exp-1) <= |sum| < 2**top_exp */ top_exp = FSUM_MIN_EXP + FSUM_SIZE * (ndigits - 1); for (digit = acc[ndigits - 1]; digit != 0; digit /= 2) top_exp++; /* catch almost all overflows here */ if (top_exp > DBL_MAX_EXP) goto overflow_error; m = 0.0; if (top_exp <= DBL_MIN_EXP) { /* no need for rounding */ for (i = ndigits - 1; i >= 0; i--) m = FSUM_BASE * m + acc[i]; return PyFloat_FromDouble((double) sign * ldexp(m, FSUM_MIN_EXP)); } /* round: e is the position of the first bit to be rounded away. */ e = top_exp - DBL_MIN_EXP - 1; half_eps = (long) 1 << (e % FSUM_SIZE); q = e / FSUM_SIZE; for (i = ndigits - 1; i > q; i--) m = FSUM_BASE * m + acc[i]; digit = acc[q]; m = FSUM_BASE * m + (digit & (FSUM_BASE - 2 * half_eps)); if ((digit & half_eps) != 0) { round_up = 0; if ((digit & (3 * half_eps - 1)) != 0 || (half_eps == FSUM_BASE / 2 && (acc[q + 1] & 1) != 0)) round_up = 1; else for (i = q - 1; i >= 0; i--) if (acc[i] != 0) { round_up = 1; break; } if (round_up == 1) { m += 2 * half_eps; if (top_exp == DBL_MAX_EXP && m == ldexp((double)(2 * half_eps), DBL_MANT_DIG)) /* overflow corner case: result before rounding is within range, but rounded result overflows. */ goto overflow_error; } } return PyFloat_FromDouble((double) sign * ldexp(m, FSUM_MIN_EXP + FSUM_SIZE * q)); overflow_error: PyErr_SetString(PyExc_OverflowError, "fsum result too large to represent as a float"); return NULL; } ```
waldemarhorwat commented 7 months ago

I also wrote code to convert doubles to/from BigInts (scaled up by 2**1023)

This actually scales up the double values by 21075, which is one factor of two too many — every resulting BigInt is even. The smallest representable nonzero double is 2-1074 so it would be less confusing to scale up the values by 21074 unless for some reason you need every BigInt to be even.

I also don't understand this snippet:

  let binary = magnitude.toString(2);
  if (binary === 0) {…}

The if condition compares a string to a number using ===, which can never be true.

bakkot commented 7 months ago

This actually scales up the double values by 21075, which is one factor of two too many — every resulting BigInt is even. The smallest representable nonzero double is 2-1074 so it would be less confusing to scale up the values by 21074 unless for some reason you need every BigInt to be even.

Sorry, yes, 21075. I'd previously left the last bit as 0 because it let me write 2n ** (BigInt(exponent)) * significand, which was easier for me to think about than having to correct the off-by-one, but I've tweaked it now in https://github.com/bakkot/proposal-math-sum/commit/5b69773911f94aba4321fd8a3f111ef76b44e023.

I also don't understand this snippet:

Whoops, that's a leftover from an earlier version. You're right that it doesn't do anything now (and would be redundant with the binary.length <= 53 case immediately below it in any case). Removed.

bakkot commented 6 months ago

I've simplified the main implementation following this code from the original Python issue.

The implementation I had earlier was conceptually quite close to Shewchuk's algorithm: it added x into each partial in order, culminating in the "overflow" partial, handling overflows by allowing x to be biased and doing arithmetic on potentially-biased values. This is very easy to reason about, but kind of annoying to implement.

But it doesn't have to work like that. Instead, when overflow occurs during one of the "x + a partial" steps, you can reduce that sum by 2 1024 and add 2 1024 into the "overflow" partial immediately. That preserves all of the relevant properties (partials are non-overlapping and ascending in magnitude, etc), and means you don't need to worry about arithmetic on biased values.

waldemarhorwat commented 6 months ago

I'm trying to put together disparate sources to convince myself of the correctness of the theorems, specifically to make sure that the invariants assumed by a subsequent step actually match what the previous step claims to produce. Can you document the precise invariant that the partials are required to satisfy in the implementation? The details matter here.

The Python issue assumes that the invariant is that they're nonoverlapping and in increasing magnitude.

In addition to edge cases around overflows, some of what I'm looking at: "Nonoverlapping" does not mean that they're necessarily 53 bits apart; the binary values 11010000, 1000, 100, and 1.101 are nonoverlapping doubles, but some of the theorems in the papers require stronger input criteria which a sequence like that would violate. It's probably ok in this usage, but I need to check.

Quick links:

bakkot commented 6 months ago

I've been relying on Theorem 10 of the Robust Arithmetic paper, which is fairly straightforward and which only needs only the assumption that the partials are finite, nonoverlapping, and increasing in magnitude (and implicitly that two-sum doesn't overflow). Including a single final "overflow" partial whose low bit is 21024 obviously preserves the finite, nonoverlapping, increasing properties, and we can handle overflow explicitly. The algorithm fails once the "overflow" partial loses precision, at which point we throw.

The later theorems are much more complicated and require other properties, but aren't actually necessary for this implementation.

From what I can tell the paper doesn't cover producing a single sum from the expansion once you've added all the values in, but I convinced myself of the correctness of the procedure used in msum (namely, add the partials in descending order of magnitude, stopping once that loses precision) as follows: Because the partials are descending in order of magnitude, once you've started losing precision you know that the lowest bit of the value you've just added is at least a factor of 253 smaller in magnitude than the high bit of the current value of the sum, and by the nonoverlapping property the sum of all the remaining partials is strictly smaller in magnitude than the lowest bit of the value you've just added, so adding in all the remaining partials can't affect the sum except in the case that it would affect rounding, which is handled explicitly.

waldemarhorwat commented 6 months ago

Just to double-check: the invariant is that the partials are nonzero, finite, nonoverlapping, and increasing in magnitude? (Theorem 10 allows zero-valued partials in input and output, while crsum assumes that the partials are nonzero and can produce incorrect results if zero-valued partials exist).

bakkot commented 6 months ago

Yes, though strictly speaking that's covered by "increasing in magnitude" (except for possibly the first partial, for which it doesn't matter if it's zero). Theorem 10 does not guarantee that you end up with a list of partials with the "nonzero" property, but that's easily achieved by discarding partials which are 0.

To be precise: I've been relying on a slightly modified version of Theorem 10, which looks like:

If you have an expansion e where all the partials are finite, nonoverlapping and increasing in magnitude, and the Two-Sum algorithm doesn't overflow, then applying the Grow-Expansion algorithm given e and b followed by a step which discards any partials which are 0 will produce a new expansion e' where all the partials are finite, nonoverlapping, and increasing in magnitude, and such that the mathematic sum of e' is equal to the mathematical sum of e and b.

bakkot commented 3 months ago

Incidentally there's been some research into faster approaches since Shewchuk '96; the fastest I've found is described in this post and given in code here.

Per its claims there, on medium-size and larger lists of doubles (~1k+ elements), you can get exact summation with a 4x slowdown over naive summation by using 67 64-bit values (536 bytes). If you have a larger list (10k+ elements) and are willing to spend more memory, you can get exact summation with less than 2x slowdown over naive summation at the cost of using 33 Kb of memory.

radfordneal commented 2 months ago

Hi. I notice that my xsum algorithm and software is mentioned here. You may be interested that it's used by the xsum package for Julia (https://juliapackages.com/p/xsum).

You may also be interested in the following recent paper (which I haven't read in full detail, but which looks interesting):

Lange, M., Towards accurate and fast summation, https://web.archive.org/web/20220616031105id_/https://dl.acm.org/doi/pdf/10.1145/3544488

One advantage of superaccumulators, as used in my method, is that one can easily give the correct result when it's a finite double even if naive summation would produce overflow for an intermediate result. One just needs enough bits in the superaccumulator to handle summation of the largest possible double the maximum concievable number of times. In my code, I think I assume a maximum of 2^32 operands, but one can allow for more by just adding a few more bits to the superaccumulator.