python / cpython

The Python programming language
https://www.python.org
Other
63.14k stars 30.23k forks source link

Remove limits on the builtin len() function where possible. #94937

Open rhettinger opened 2 years ago

rhettinger commented 2 years ago

Problem

Pure Python __len__ methods can return values larger than sys.maxsize:

from sys import maxsize

class A:
    def __len__(self):
        return maxsize * 2

>>> A().__len__()
18446744073709551614

However, the builtin len() function unnecessarily fails:

>>> len(A())
Traceback (most recent call last):
  ...
OverflowError: cannot fit 'int' into an index-sized integer

The builtin range() type added support for ranges larger than sys.maxsize. Larger indices work; negative indices work; forward iteration works; reverse iteration works; and access to the attributes work:

>>> s = range(maxsize*10)
>>> s[maxsize * 2]
18446744073709551614
>>> s[-1]
92233720368547758069
>>> next(iter(s))
0
>>> next(reversed(s))
92233720368547758069
>>> s.start, s.stop, s.step
(0, 92233720368547758070, 1)

However, len() unnecessarily fails:

len(s)
Traceback (most recent call last):
   ...
Error: Python int too large to convert to C ssize_t

The random.sample() and random.choice() functions both depend on the builtin len() function, so they unnecessarily fail when used with large range objects or with large user defined sequence objects. Users have reported this issue on multiple occasions. We closed those issues because there was no practical way to fix them short of repairing the builtin len() function:

>>> import random

>>> random.choice(range(maxsize * 5))
Traceback (most recent call last):
  ...
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/random.py", line 371, in choice
    return seq[self._randbelow(len(seq))]
OverflowError: Python int too large to convert to C ssize_t

>>> random.sample(range(maxsize * 5), k=10)
Traceback (most recent call last):
  ...
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/random.py", line 438, in sample
    n = len(population)
OverflowError: Python int too large to convert to C ssize_t

Proposal

Make the builtin len() function smarter. Let it continue to first try the C PyObject_Size() function which is restricted to Py_ssize_t. Now add two new fallbacks, one for range objects and the other for calling the __len__ method though the C API allowing arbitrary objects to be returned.

Rough sketch:

def builtin_len(obj):
    try:
        ssize_t_int = PyObject_Size(obj)
        return PyLong_FromSsize_t(ssize_t_int)
    except OverflowError:
        pass
    if isinstance(obj, type(range)):
        start, stop, step = obj.start, obj.stop, obj.step
        assert step != 0
        if step > 0:
            return (stop - start + step - 1) // step
        return (start - stop - step - 1) // -step
    return PyObject_CallMethod(obj, '__len__', NULL)

Bug or Feature Request

Traditionally, extending support for sizes beyond Py_ssize_t has been considered a new feature, range() and itertools.count() for example.

In this case though, arguably it is a bug because the range() support was only 90% complete, leaving off the ability to call len(). Also it could be considered a bug because users could always write a __len__ method returning values larger than Py_ssize_t and could access that value with obj.__len__ but the len() function inexplicably failed due to an unnecessary and implementation dependent range restriction.

Other other thought: maxsize varies across builds, so it is easily possible to get code tested and working on one Python and have it fail on another. All 32-bits builds are affected and all Windows builds.

It would be easy for us to remove the artificial limitation for range objects and for objects that define __len__ directly rather than through sq_length or mp_length. That includes all pure Python classes and any C classes that want to support large lengths.

brandtbucher commented 2 years ago

A related issue is that len calls are frustratingly inefficient for pure-Python __len__ functions.

In the vast majority of cases (__len__ returns an exact int), we still end up unboxing and reboxing the value just to cross API boundaries, which is the source of the range problem here.

A len call where __len__ returns an exact int should, ideally, perform the super-cheap negative value check and nothing more. It would be great if whatever solution we come up with here is able to improve this situation as well.

rhettinger commented 2 years ago

Good point Brandt. It would be great if we could address both of these issues at the same time.

IIRC, Guido said that len() and isinstance() are the two most commonly called functions in Python.

brandtbucher commented 2 years ago

As a starting point, I like the interface of nb_index. It is expected to return an exact int.

Quarter-baked idea: add sq_py_length and mp_py_length slots. They are expected to return nonnegative int objects.

As an alternative, we could also just have len start calling __len__ directly. This is easier to explain, but likely comes with non-negligible overhead when called on types defined in C (probably the common case).

rhettinger commented 2 years ago

I was originally thinking of leaving PyObject_Size() and the existing slots as they are now. Instead, add PyObject * PyObjectLen(PyObject *obj) to be called by the builtin len() function. It would first call PyObject_Size() but then have a fallback that would have a special case for range objects and would invoke __len__ when defined outside of the slot. The latter would work for any pure Python function. Types implemented in C could add METH_COEXIST if they wanted a method that could be called directly and return a Python object.

Adding new slots would work as well and that might be cleaner. Historically, we've had a strong resistance to adding new slots, but the shared mentality and values are changing.

fpavetic commented 6 months ago

Hi, thanks for working on this! Has a decision been made on this proposal? Are there plans to include the fix in future releases?

ncoghlan commented 5 months ago

Note that this wasn't a compatibility problem when range() was converted to a virtual sequence in Py3 since the Py2 range() builtin would just die outright for sequences that long (lists larger than maxsize necessarily won't fit in RAM), and iterrange() didn't support __len__ at all. The reported restriction is real, but the workaround at affected call sites is also straightforward: call r.__len__() when len(r) fails on the conversion to ssize_t.

This means the random module limitations could potentially be fixed without any length protocol or len() builtin enhancements by adding a fallback to call r.__len__() directly if len(r) fails with OverflowError.

@rhettinger's proposal would essentially be standardising that workaround as part of len(), providing a convenient C API for it, and also an optimised path specifically for range() objects (all of which seem like reasonable suggestions to me).

It's also worth noting that using this fallback-on-overflow approach initially wouldn't preclude enhancing the slot level protocol later. Both len() and the suggested C API would still work, they would just become more efficient internally (with benchmark performance deciding whether the extra slot protocol complexity was worth it).