Improve accuracy of builtin sum() for float inputs

python / cpython

The Python programming language

https://www.python.org

Other

62.4k stars 29.96k forks source link

Improve accuracy of builtin sum() for float inputs #100425

Closed rhettinger closed 1 year ago

rhettinger commented 1 year ago

Currently sum() makes no efforts to improve accuracy over a simple running total. We do have math.fsum() that makes extreme efforts to be almost perfect; however, that function isn't well known, it runs 10x slower than regular sum(), and it always coerces to a float.

I suggest switching the builtin sum() handling of float inputs to Arnold Neumaier's improved variation of compensated summation. Per his paper, this algorithm has excellent error bounds (though not as perfect as fsum():

|s - š| ≤ ε|s| + ε²(3/4n² + n)·Σ|aᵢ|               (IV,12)
|s - š| ≤ ε|s| + ε²(1/4n³ + 5/2n² + n)·Max|aᵢ|     (IV,13)

The compensation tracking runs in parallel to the main accumulation. And except for pathological cases, the branch is predictable, making the test essentially free. Thanks to the magic of instruction level parallelism and branch prediction, this improvement has zero cost on my Apple M1 build. Timed with:

./python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'

N.B. Numpy switched from a simple running total to partial pairwise summation. That isn't as accurate as what is being proposed here, but it made more sense for them because the extra work of Neumaier summation isn't masked by the overhead of fetching values from an iterator as we do here. Also with an iterator, we can't do pairwise summation without using auxiliary memory.

Linked PRs

gh-100426
gh-100860
gh-101854
gh-107785
gh-107787

mdickinson commented 1 year ago

Very nice! It's surprising how little code it takes to make this work.

A few questions:

If this went in, would it be considered a CPython implementation detail, or would other implementations be expected to follow suit?
I'd expect at least some breakage in downstream unit tests (and perhaps doctests) from the changed numbers. People shouldn't be relying on down-to-the-last-bit reproducibility from sum in unit tests, but I'd like to bet that at least some are, and in practice our current sum does effectively give reproducible results across platforms, especially now that x87 FPUs and the associated double-rounding issues are fading into obscurity. What level of breakage would we consider acceptable here?
Did you try timing the variant that uses 2Sum in place of Fast2Sum for computing the individual errors? It involves more floating-point operations, but eliminates the branch in the absolute value test.
Would we need to keep a way to get the naive, uncompensated sum? It's a need that I've occasionally had, and I've written code that relies on sum doing the obvious thing. I'll admit that it's likely not a common use-case, and people who need it can still do functools.reduce(operator.add, values, start).

On performance, I'm not seeing the "zero cost" on macOS / Intel. Here are some timings on my machine (comparing commit af7613ebcf1d118291320f07e9fab8100266c13c on this branch - "cpython" below - with commit e0b4d966a8d1867c4b535b043e08288ca49b3548 on main - "cpython-compare" below); both clones built with --enable-optimizations. tl;dr - this is around 25% slower on my machine.

mdickinson@lovelace python % ./cpython/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
5000 loops, best of 5: 44.9 usec per loop
mdickinson@lovelace python % ./cpython/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
5000 loops, best of 5: 47.6 usec per loop
mdickinson@lovelace python % ./cpython/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
5000 loops, best of 5: 45 usec per loop
mdickinson@lovelace python % ./cpython-compare/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
10000 loops, best of 5: 35.3 usec per loop
mdickinson@lovelace python % ./cpython-compare/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
10000 loops, best of 5: 34.5 usec per loop
mdickinson@lovelace python % ./cpython-compare/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
10000 loops, best of 5: 34.4 usec per loop

Here's another test case that involves a decent amount of cancellation - we're computing (the imaginary part of) the Gauss sum for the prime 609359; mathematically, it should give exactly √609359. Here's the code:

import fractions
import math
import sys
import timeit

# Test case: Gauss sum for the prime 609359; mathematically, result
# should be exactly sqrt(p).
p = 609359
terms = [math.sin(2*math.pi*r*r / p) for r in range(p)]

# Exact sum of the actual terms we're using, and a way to compute the
# ulps error.
true_sum = sum(map(fractions.Fraction, terms))

def ulps_error(x: float) -> float:
    return (fractions.Fraction(x) - true_sum) / math.ulp(true_sum)

# Error incurred by sum.
error = ulps_error(sum(terms))

# Timing
number = 1000
time = timeit.timeit(stmt="sum(terms)", globals=globals(), number=number)

print(sys.version)
print(f"error: {ulps_error(sum(terms))} ulps")
print(f"time: {1000.0 * time/number:.3f} ms per run")

Results on my machine show around a 20% slowdown:

mdickinson@lovelace python % ./cpython-compare/python.exe ~/Desktop/compensated.py
3.12.0a3+ (tags/v3.12.0a3-114-ge0b4d966a8:e0b4d966a8, Dec 22 2022, 13:23:36) [Clang 14.0.0 (clang-1400.0.29.202)]
error: 58.52302999049425 ulps
time: 3.066 ms per run
mdickinson@lovelace python % ./cpython/python.exe ~/Desktop/compensated.py        
3.12.0a3+ (heads/compensated_sum:af7613ebcf, Dec 22 2022, 13:34:38) [Clang 14.0.0 (clang-1400.0.29.202)]
error: -0.47697000950574875 ulps
time: 3.799 ms per run
mdickinson@lovelace python % ./cpython-compare/python.exe ~/Desktop/compensated.py
3.12.0a3+ (tags/v3.12.0a3-114-ge0b4d966a8:e0b4d966a8, Dec 22 2022, 13:23:36) [Clang 14.0.0 (clang-1400.0.29.202)]
error: 58.52302999049425 ulps
time: 3.063 ms per run
mdickinson@lovelace python % ./cpython/python.exe ~/Desktop/compensated.py        
3.12.0a3+ (heads/compensated_sum:af7613ebcf, Dec 22 2022, 13:34:38) [Clang 14.0.0 (clang-1400.0.29.202)]
error: -0.47697000950574875 ulps
time: 3.789 ms per run

On pairwise summation: agreed that this (i.e., compensated summation) seems superior for our use-case, though it might be interesting to code up pairwise summation just to see what the performance tradeoffs look like. It doesn't need much in the way of auxiliary memory: at most a few hundred bytes for maintaining a stack of partial values of size O(log_2(length)). E.g., in Python:

def _ruler_seq():
    """Sequence 0, 1, 0, 2, 0, 1, 0, 3, ... (A007814)"""
    i = 0
    while True:
        yield (~(i + 1) & i).bit_length()
        i += 1

def sum_pairwise(iterable, /, start=0.0):
    """Sum of an iterable of floats, via pairwise summation."""
    partials = [start]
    for x, pop_count in zip(iterable, _ruler_seq()):
        for _ in range(pop_count):
            x += partials.pop()
        partials.append(x)
    return sum(reversed(partials))

(The _ruler_seq logic would of course be much more streamlined in C.)

rhettinger commented 1 year ago

If this went in, would it be considered a CPython implementation detail, or would other implementations be expected to follow suit?

What are your thoughts on this? I was thinking that it has to be an implementation detail because it is so easily destroyed by double rounding environments or faulty compilers that optimize away the compensation. That said, I would hope that other implementations would do this or something similar. It doesn't take much effort.

What level of breakage would we consider acceptable here?

Fortunately, we have a precedent here. When numpy switched to pairwise summation, the world was net better off and seemed happy to fix a few broken tests in exchange for better accuracy. Another precedent occurred when the floating point repr changed to display the shortest decimal representation from the equivalence class. Again, the world was net happier for the problems that it no longer had and was happy to fix a few broken tests.

FWIW the PR also causes "unbreakage". Code that is normalizing a list (i.e. converting values to percentages) will work better without the user having to make changes: t = sum(data); percentages = [x * 100.0 / t for x in data]. Commutativity problems will fade, etc. In general, it is nice to have numbers add up correctly most of the time. The notion of "worse is better" doesn't apply here. Part of Matlab's success is that so many subtle numeric issues were handled under-the-hood in sophisticated ways.

Results on my machine show around a 20% slowdown

Thanks for running a timing on Intel silicon. That is about what I had expected and it seems like a reasonable price. The Apple M1 timings were super fast but perplexing. The sum of 10k random values took only 25.2 usec with the PR and 28.2 usec without the PR. Presumably, we working right at the limit of optimizations and parallelization where any change can knock it in and out of local minimums.

Did you try timing the variant that uses 2Sum in place of Fast2Sum for computing the individual errors?

I didn't but will look at it. I first tried straight Kahan summationand then the Neumaier variant. When the latter proved to be both faster (because it broke up the data dependency chain) and more accurate, I was pleasantly surprised and took the win and stopped exploring.

Also, I looked as the generated assembly and there isn't much fat to be cut. The fabs is a single fast instruction. The "branch" is just a predicated instruction so we're effectively branchless. The additions and subtractions for the compensated error calculation aren't on the critical path because they don't have an immediate data dependency (the compensation gets applied outside the loop). As long as the processor has more than one floating point port, that compensation calculation happens in parallel and is basically free. Our only expensive steps are the PyIter_Next() and a _Py_DECREF_SPECIALIZED call. The actual addition is the cheapest part of the overall operation. So I didn't feel the need to do further algorithmic explorations. That said, the story may be a bit different on Intel silicon.

LBB61_38:                       ;   in Loop: Header=BB61_22 Depth=1             
    ldr d0, [x21, #16]  ; ./Include/cpython/floatobject.h:16:31   <-- Float unboxing runs in parallel with previous float computation
    fadd    d10, d8, d0     ; Python/bltinmodule.c:2553:30      <-- Only this is sequentially dependent between loops
    fabs    d1, d8          ; Python/bltinmodule.c:2554:21      <-- 2 cycle latency; 4 cycle throughput
    fabs    d2, d0                 
    fsub    d3, d8, d10     ; Python/bltinmodule.c:2554:21
    fadd    d3, d0, d3
    fsub    d0, d0, d10
    fadd    d0, d8, d0
    fcmp    d1, d2
    fcsel   d0, d0, d3, lt   ; <-- predicated instruction instead of a branch
    fadd    d9, d9, d0
    ;DEBUG_VALUE: _Py_DECREF_SPECIALIZED:op <- $x21            
    ldr x8, [x21]        ; The decref step runs in parallel with float computations and is not dependent on their outcome
    subs    x8, x8, #1
    str x8, [x21]

Would we need to keep a way to get the naive, uncompensated sum? It's a need that I've occasionally had, and I've written code that relies on sum doing the obvious thing. I'll admit that it's likely not a common use-case, and people who need it can still do functools.reduce(operator.add, values, start).

I can mention the reduce alternative in the docs, so a user won't have to think of this on their own. That said, the numpy precedent tells us that this isn't a big concern — we don't see people asking how to bypass np.sum(), nor have they had need to document how to run consecutive additions in a loop.

mdickinson commented 1 year ago

Thanks for the responses and the code updates for infinity and overflow. The code LGTM, and I really like the feature.

Again, the world was net happier for the problems that it no longer had and was happy to fix a few broken tests.

Agreed; this seems manageable.

The only remaining question for me is whether the extra accuracy is worth the performance hit. It would be useful to get some numbers on Linux/x64 and Windows/x64 (and maybe ARM, too).

I'm +1 on making this trade-off. (I'm personally not too bothered by Python's sum performance for large lists of floats, since I'd almost always be using NumPy instead of Python in cases where performance is likely to matter.) But I don't know how others might feel. Worth getting opinions from the faster-cpython folks and the scientific community?

mdickinson commented 1 year ago

I was thinking that it has to be an implementation detail [...]

SGTM; I'd expect that PyPy at least would want to match the new behaviour, but that's up to them. I'd also want to leave the door open for implementations to use some other flavour of compensated summation if that works better for them, for whatever reason.

mdickinson commented 1 year ago

I'd also want to leave the door open for implementations to use some other flavour of compensated summation if that works better for them

As a data point, it looks as though Java's DoubleSummaryStatistics currently implements plain old Kahan compensated summation (though the docs don't commit to any particular method being used), so a JVM-based version of Python could potentially want to use that.

pochmann commented 1 year ago

If I'm not mistaken, you're not using c if the floats/ints are followed by some other type, e.g., complex. Is that intentional?

I mean here:

https://github.com/python/cpython/blob/4da0b8ff2e83ecea46bab85d56850851ea3c7eec/Python/bltinmodule.c#L2575

rhettinger commented 1 year ago

@mdickinson Okay. Marked as an implementation detail. We and other implementations are free to switch methodologies at any time.

@pochmann I'm not sure that it matters, but for consistency, I added c on the other path as well. Thanks for noticing.

CAM-Gerlach commented 1 year ago

FWIW, testing on Windows on an Ivy Bridge i7-3630QM 4c/8t laptop downclocked to 2.3 GHz, I get the following results with MSC v.1916 x64 (the Python under the test worktree is this branch, while main is the tip of the current main branch (which I assume is essentially the merge base here). I built with the default PCbuild/build.bat in x64 and Release configuration; I didn't see any other optimization options besides PGO

Testing the first set of @mdickinson 's timeit calls, I get a 33.6%, 33.7% and 29.2% slowdown with this PR vs main:

$ PCbuild/amd64/python.exe -c "import sys; print(sys.version)"
3.12.0a3+ (heads/main-dirty:1ecfd1ebf1, Dec 23 2022, 16:48:43) [MSC v.1916 64 bit (AMD64)]
$ PCbuild/amd64/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_0 00)]' 'sum(d)'
5000 loops, best of 5: 53.6 usec per loop
$ PCbuild/amd64/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_0 00)]' 'sum(d)'
5000 loops, best of 5: 53.1 usec per loop
$ PCbuild/amd64/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_0 00)]' 'sum(d)'
5000 loops, best of 5: 53.4 usec per loop
$ test/PCbuild/amd64/python.exe -c "import sys; print(sys.version)"
3.12.0a3+ (heads/compensated_sum-dirty:3cf9536ad7, Dec 23 2022, 16:51:47) [MSC v.1916 64 bit (AMD64)]
$ test/PCbuild/amd64/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (10_000)]' 'sum(d)'
5000 loops, best of 5: 71.6 usec per loop
$ test/PCbuild/amd64/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (10_000)]' 'sum(d)'
5000 loops, best of 5: 71 usec per loop
$ test/PCbuild/amd64/python.exe -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (10_000)]' 'sum(d)'
5000 loops, best of 5: 69 usec per loop

And testing @mdickinson 's compensated.py script, I get a 35% slowdown:

$ PCbuild/amd64/python.exe compensated.py
3.12.0a3+ (heads/main-dirty:1ecfd1ebf1, Dec 23 2022, 16:48:43) [MSC v.1916 64 bit (AMD64)]
error: 64.80813372135162 ulps
time: 3.568 ms per run
$ test/PCbuild/amd64/python.exe compensated.py
3.12.0a3+ (heads/compensated_sum-dirty:3cf9536ad7, Dec 23 2022, 16:51:47) [MSC v.1916 64 bit (AMD64)]
error: -0.19186627864837646 ulps
time: 4.804 ms per run

For what it's worth, on Python 3.11.0 and the latest NumPy 1.24.0 and dependencies (via Conda-Forge), np.sum is around 2.7x faster on a NumPy array vs the builtin sum on a list, both on 10 000 elements generated as above (with sum having very similar results to the above, and the two also being similar on Python 3.9 and older NumPy), and the difference grows to around 5x on 1 000 000 elements.

$ python -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (10_000)]' 'sum(d)'
5000 loops, best of 5: 52.8 usec per loop
$ python -m timeit -s 'from random import expovariate as r' -s 'import numpy as np' -s 'd=np.array([r(1.0) for i in range (10_000)])' 'np.sum(d)'
20000 loops, best of 5: 19.8 usec per loop
$ python -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (1_000_000)]' 'sum(d)'
50 loops, best of 5: 5.71 msec per loop
$ python -m timeit -s 'from random import expovariate as r' -s 'import numpy as np' -s 'd=np.array([r(1.0) for i in range (1_000_000)])' 'np.sum(d)'
200 loops, best of 5: 1.16 msec per loop

In general, as a member of the scientific Python community (but merely one single datapoint), I would likewise typically be using NumPy arrays (or data structures build from them, e.g. DataFrames, xarrays, tensors, etc) for at least the great majority of cases that would likely matter for performance when summing many elements. But I certainly can't speak for everyone in the general case, and it seems worth considering more datapoints (both of benchmark results, and use cases) before introducing a potential ≈20-35% regression to the performance of a key builtin on a fundamental datatype.

hauntsaninja commented 1 year ago

Mark had asked for more performance numbers. On the microbenchmark, I also see about a 30% slowdown on Linux with an AMD Zen 2 chip, with --enable-optimizations

~/cpython 3e46f9fe05 λ ./python -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
5000 loops, best of 5: 55 usec per loop
# Ran 5 times, saw min 55, max 55.4, avg 55.2

~/cpython 1ecfd1ebf1 λ ./python -m timeit -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range(10_000)]' 'sum(d)'
10000 loops, best of 5: 37.6 usec per loop
# Ran 5 times, saw min 37.3, max 37.8, avg 37.6

(I don't have an opinion on the tradeoff. The scientific computing I do at work is mostly not on CPUs. It looks like Raymond already considered the partial pairwise thing that numpy does)

gpshead commented 1 year ago

FWIW I think the trade-off here is fine, people who really want sum performance on a huge quantity of floats should be using numpy.

CAM-Gerlach commented 1 year ago

Just for reference, results on the same with math.fsum(), demonstrating its an order of magnitude (nearly 10x, prior to this change) slower than the sum() builtin on the same microbenchmark as above:

$ PCbuild/amd64/python.exe -c "import sys; print(sys.version)"
3.12.0a3+ (heads/main-dirty:1ecfd1ebf1, Dec 23 2022, 16:48:43) [MSC v.1916 64 bit (AMD64)]
$ PCbuild/amd64/python.exe -m timeit -s 'from math import fsum' -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (10_000)]' 'fsum(d)'
500 loops, best of 5: 559 usec per loop
$ PCbuild/amd64/python.exe -m timeit -s 'from math import fsum' -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (10_000)]' 'fsum(d)'
500 loops, best of 5: 470 usec per loop
$ PCbuild/amd64/python.exe -m timeit -s 'from math import fsum' -s 'from random import expovariate as r' -s 'd=[r(1.0) for i in range (10_000)]' 'fsum(d)'
500 loops, best of 5: 464 usec per loop

tim-one commented 10 months ago

There is no negative exponent here. It's 1e16, not 1e-16. The question is which one of these two adjacent representable floats the sum rounds to:

>>> float(10**16 + 2).hex() # single rounding
'0x1.1c37937e08001p+53'
>>> float(10**16 + 4).hex() # double rounding
'0x1.1c37937e08002p+53'

franklinvp commented 2 months ago

This didn't need to go into the implementation of a built in function, when applied to a basic type.

To me this is giving people what they want (not be surprised that float are not the real numbers) vs what they need (learn it, and if they need operations with more precision or as if there was more precision, or with correction, then search for a function that is explicitly doing that).

Now sum on float is no longer adding float with the addition that they come with.

franklinvp commented 2 months ago

Some consequences of the implementation

class R(float): pass
sum([0.0, 0.0, float(2**53), 1.0, 1.0, 0.0, 0.0])  # -> 9007199254740994.0 All the sums used the correction
sum([R(0.0), 0.0, float(2**53), 1.0, 1.0, 0.0, 0.0])  # -> 9007199254740994.0 Essentially all the sums used the correction
sum([0.0, R(0.0), float(2**53), 1.0, 1.0, 0.0, 0.0])  # -> 9007199254740992.0 Ordinary sum of float for pretty much the entire sum
sum([0.0, 0.0, float(2**53), 1.0, 1.0, 0.0, R(0.0)])  # -> 9007199254740994.0 Nearly all sums used the correction

Could the built in function sum be kept with a simple definition of "adding from left to right with sum the items define" and leave specialized algorithms to specialized functions?

The "however, that function isn't well known" regarding fsum can be solved by a note in the documentation of sum suggesting to use fsum, numpy.sum, or a function implementing this Kahan-like compensation.

This, as opposed to having sum's documentation only have this note

Changed in version 3.12: Summation of floats switched to an algorithm that gives higher accuracy on most builds.

That hides a lot of detail of what it really does, and to make the description accurate it would add quite a bit of complexity to the explanation. "Adds floats from left to right, using a compensated sum if from the beginning (except possibly the first) are float. If there is a non-float (beyond the first item) the sum from that point on is that of the items. If the sum of the first and start is not float, the compensated sum also stops, but if that sum is a float then the sum is the compensated one."

Also, from a didactic point of view, letting people face the peculiarities of adding float forces them to learn them. Then, if it is their choice to use a better algorithm, being it in a different function, makes the code explicit about it.

skirpichev commented 2 months ago

@franklinvp, I think this issue has enough arguments for this feature of the sum(). See also #111933. If you really want to revive discussion - probably you should rather start a discourse thread (on https://discuss.python.org/).

BTW, it's not just "adding from left to right with sum the items define". Here is the docstring:

    Return the sum of a 'start' value (default: 0) plus an iterable of numbers

    When the iterable is empty, return the start value.
    This function is intended specifically for use with numeric values and may
    reject non-numeric types.

Sphinx docs:

Sums start and the items of an iterable from left to right and returns the total.
The iterable’s items are normally numbers, and the start value is not
allowed to be a string.

Neither specify how to compute the sum of numbers, i.e. this shouldn't be just an equivalent of functools.reduce(operator.add, xs): these details are (and should be!) under the hood. If addition is associative - this difference doesn't matter. But for floating-point numbers it's not: we should choose a useful definition for the sum of numbers, not just pick up one random order of parentheses.

pochmann3 commented 2 months ago

If addition is associative - this difference doesn't matter

Associative isn't enough, think for example of string addition, which is associative but "1" + "2" differs from "2" + "1". Or a case with floats where (a+b)+c = a+(b+c) but (a+c)+b differs:

a = c = 1e308
b = -a
print((a + b) + c)
print(a + (b + c))
print((a + c) + b)

Output (Attempt This Online!):

1e+308
1e+308
inf

franklinvp commented 2 months ago

@pochmann3 Be careful

with floats where (a+b)+c = a+(b+c)

that is not always satisfied.

(float(2**53) + float(1)) + float(1) ==  float(2**53)+ (float(1) + float(1))  # False

@skirpichev #111933 is an example of what the new sum caters to, people who see float's arithmetic as defective. That ticket just has the wrong expectations of what sum was (they expected associative), which doesn't have to be expected of the new sum.

The documentation of sum does say "from left to right" explicitly.

And the explanation that it has for float is not adequate.

It is true that the documentation does not specify which sum is done from left to right. It is bizarre that for float the new sum uses one sum sometimes and sometimes does not.

class R(float): pass
# After `R(0.0)` is the sum of `float`, before it it is a compensated sum.
sum([float(2**53), float(1.0), float(1.0), R(0.0), float(2**53), float(1.0), float(1.0)])

skirpichev commented 2 months ago

think for example of string addition, which is associative but "1" + "2" differs from "2" + "1".

Well, I'm assuming numeric values ("This function is intended specifically for use with numeric values and may reject non-numeric types."), i.e. commutative addition. Strings are banned by sum(), but there is a way:

>>> sum([[1], [2]], start=[3])
[3, 1, 2]
>>> sum([[2], [1]], start=[3])
[3, 2, 1]

a case with floats where (a+b)+c = a+(b+c) but (a+c)+b differs

That's just another example of missing associativity: a+(b+c) is equal (by commutativity) to a+(c+b), which in general differs from (a+c)+b.

(float(2**53) + float(1)) + float(1)

Lets take this example. sum() now returns 9007199254740994.0 regardless of the elements order, which is exact value of sum([2**53, 1, 1]), while (float(2**53) + 1.) + 1. is 9007199254740992.0 (ulps error is 1, not zero).

The documentation of sum does say "from left to right" explicitly.

Looking on the implementation, I don't see it's violated somehow. Code does summation "from left to right", it's just not simple result+item.

# After R(0.0) is the sum of float, before it it is a compensated sum.

Strictly speaking, R(0.0) is not a float, but an instance of it's subclass.

Specialized code uses Py*_CheckExact() tests (~ type(foo) is float), maybe it's a historical artifact and now it's safe to use just Py*_Check(). Such checks are more costly, however, especially for negatives.

franklinvp commented 2 months ago

Strictly speaking, R(0.0) is not a float, but an instance of it's subclass.

The items after it are float and are added as with the sum of float. The items before it are also float, but are added with a different sum.

The reason for not implementing sum like this is simple. Some users are missing a feature, and there is a way for no one to miss anything. Among the multiple ways Python is used there are two: Replication of results from other languages, and Education since it so simple looking and easy to read.

sum used to provide a terse, easy to read, way to get the sum of float associated from left to write, implemented in a lower level language. The alternatives now, the for loop and functools.reduce, don't have those properties.

sum now, hides now, partially, some of the peculiarities of the float arithmetic. Fewer people complaining about commutativity, I suppose, a loss for education since less people asking means less people learning.

Putting this compensated sum in some other function nobody looses.

skirpichev commented 2 months ago

The items after it are float and are added as with the sum of float. The items before it are also float, but are added with a different sum.

In short, R(0.0) appearance cancel running of float-specific algorithms. This happened before too, that can be demonstrated by benchmark:

sk@note:~ $ python3.11 -m timeit -s 'class F(float): pass' -s 'xs=[.1]*10' 'sum(xs)'
500000 loops, best of 5: 416 nsec per loop
sk@note:~ $ python3.11 -m timeit -s 'class F(float): pass' -s 'xs=[.1]*10;xs[3]=F(.1)' 'sum(xs)'
500000 loops, best of 5: 688 nsec per loop

But now also results differ:

sk@note:~ $ python3.12 -c 'xs=[.1]*10;print(sum(xs))'
1.0
sk@note:~ $ python3.12 -c 'xs=[0.1]*10;xs[3]=type("F", (float,), {})(0.1);print(sum(xs))'
0.9999999999999999

As I said above, perhaps the specialization could be extended to subclasses of builtin types. But it's another issue.

I don't see how current behaviour violates the docs. It's your interpretation, that sum() should behave like reduce(add, xs), but it's not that actually is documented.

The alternatives now, the for loop and functools.reduce, don't have those properties.

I don't see why someone will want fast, but inaccurate sum.

Putting this compensated sum in some other function nobody looses.

I think we are repeating already given arguments. In the referenced above issue #111933 several core devs rejected idea of restoring pre-3.12 behaviour.

If you wish to change that - try to open a discourse thread. BTW, now this will be a backward compatibility break, sorry.

franklinvp commented 2 months ago

This happened before too

Timing does not prove anything regarding which operation is used to add an F[float] to a float. A proof requires looking at the implementation.

But now also results differ:

Known. It is you who is repeating things.

perhaps the specialization could be extended to subclasses of builtin types

That makes the cons that I said above worse. I hope they really don't do that. At least now there is a hacky way to get sum behave like a fast sum of float with their own sum.

backward compatibility break

They clearly did not care about that, when they introduced this change. They just got overenthusiastic about how easy it was to add this compensated sum to the existing implementation.

Anyway, I don't wish for anything. People do whatever they want. They can point out a thousand reasons why this is a good change, and I agree with probably all of them. I am only pointing out that this change affects some uses of Python, when it is possible to add this algorithm without affecting anyone.

skirpichev commented 2 months ago

Timing does not prove anything [...] A proof requires looking at the implementation.

In this case difference is too significant to be ignored. And of course you can verify this conclusion looking to the sources.

At least now there is a hacky way to get sum behave like a fast sum of float with their own sum.

I'm afraid, that it's not:

$ python3.11 -m timeit -s 'class F(float): pass' -s 'f=F(0.1); xs=[f]*10' 'sum(xs)'
500000 loops, best of 5: 788 nsec per loop

They clearly did not care about that, when they introduced this change.

I don't think so. In the first comment by Mark - raised question on breaking backward compatibility. Fast alternative for uncompensated sum also was discussed.

Anyway, I don't wish for anything.

Then what you are trying to do here?

franklinvp commented 2 months ago

From 3.11

if (PyFloat_CheckExact(item)) {
                f_result += PyFloat_AS_DOUBLE(item);
                _Py_DECREF_SPECIALIZED(item, _PyFloat_ExactDealloc);
                continue;

gpshead commented 2 months ago

A long closed and implemented issue (this has already shipped) is not the place to be having a discussion. I'm going to lock the conversation here. If you want to discuss a feature to be implemented or believe there is a bug, bring it up on discuss.python.org or file a new issue.