pytoolz / toolz

A functional standard library for Python.
http://toolz.readthedocs.org/
Other
4.57k stars 258 forks source link

Faster check for isiterable #545

Closed groutr closed 1 year ago

groutr commented 1 year ago

Use hasattr to check if something is iterable or not by checking specifically for iter method. When the input is not an iterable, catching the exception is expensive.

ping: @eriknw

eriknw commented 1 year ago

Yeah, this is probably better. Thanks @groutr!

LincolnPuzey commented 1 year ago

@eriknw I think this should be reverted, the python documentation says:

The only reliable way to determine whether an object is iterable is to call iter(obj).

The previous code did that. The new code will fail for objects that iterate using the __getitem__() method

eriknw commented 1 year ago

I'm convinced. Thanks @LincolnPuzey! We should have been more careful.

groutr commented 1 year ago

@LincolnPuzey thank for pointing that out. This lead me down a long rabbit hole of understanding the sequence protocol more deeply. The original intention of this PR was to find a way to reduce the overhead of checking an if an object was iterable and try to iron out the relatively large performance hit when an object wasn't iterable. My mistake was thinking that the old sequence protocol fell out of fashion when Python 3 came around (the iterable sequence protocol has existed since before Python 2.1 when the iterator protocol was implemented). My encounters with sequence protocol, where __iter__ hasn't been defined, have been rare. However, as the documentation you linked to points out, not following the iterator protocol doesn't necessarily mean your object isn't iterable. The collections.abc.Iterable abstract class only checks for iterator protocol. There has been discussion about deprecating the sequence protocol behavior here, however, it seems that there is no universally good reason to remove that from Python.

Basically are three cases that can exist (AFAICT):

  1. Object implements a callable __iter__. These objects are iterable (and are an iterator if __next__ is defined).
  2. Object implements __getitem__. These objects can be iterated over using the sequence protocol (which will use getitem with increasing integers starting with 0).
  3. Object explicitly sets __iter__ = None. This is defined as not iterable. (I think this covers all the cases I could think of or find. If I'm missing any, please let me know).

Here is an alternative implementation:

def test_iterable(obj):
    xiter = getattr(obj, '__iter__', no_default)
    if xiter is None:
        return False
    elif xiter is no_default:
        return hasattr(obj, '__getitem__')
    elif callable(xiter):
        return True
    else:
        return False

And some test cases and benchmarks

class T1:
    # Case 2
    def __getitem__(self, i):
        return i
class T2:
    # Case 1
    def __iter__(self):
        return self
    def __next__(self):
        return 0
class T3:
    # Case 3
    def __getitem__(self, i):
        return i
    __iter__ = None
class T4:
    # Case 1, but with no __next__
    def __iter__(self):
        return iter([])
class T5:
    # Case 1, but __iter__ not callable.
    __iter__ = False
class T6:
    # Case 3
    __iter__ = None
tests = [T1(), T2(), T3(), T4(), T5(), T6()]   # [True, True, False, True, False, False]
for t in tests:
    print(t)
    %timeit isiterable(t)
all(isiterable(t) == test_iterable(t) for t in tests)   # True

# isiterable(t1)
<__main__.T1 object at 0x7f1513a4c520>
156 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
<__main__.T2 object at 0x7f1502f760e0>
210 ns ± 0.476 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T3 object at 0x7f1502f766b0>
503 ns ± 24.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T4 object at 0x7f1502f77be0>
251 ns ± 0.248 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T5 object at 0x7f1502f74b80>
542 ns ± 36.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T6 object at 0x7f1502f768c0>
478 ns ± 2.74 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# test_iterable(t1)
<__main__.T1 object at 0x7f1513a4c520>
237 ns ± 1.13 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T2 object at 0x7f1502f760e0>
237 ns ± 0.414 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T3 object at 0x7f1502f766b0>
163 ns ± 4.28 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
<__main__.T4 object at 0x7f1502f77be0>
232 ns ± 2.29 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T5 object at 0x7f1502f74b80>
210 ns ± 9.85 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
<__main__.T6 object at 0x7f1502f768c0>
161 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

There might still be some value in going with an alternative implementation of isiterable. However, if the function is producing wrong results, then that is clearly unacceptable.

LincolnPuzey commented 1 year ago

thanks for the deep dive @groutr

I like your test cases. T5 is interesting. It is not iterable because calling __iter__ (in this case False) raises a TypeError. The same thing would happen for this class:

class T7:
    def __iter__(self):
        raise TypeError

But I think your alternate implementation would classify that as iterable.

For now, I have opened a PR reverting the change back and adding some more test cases in #551.

I will leave it up to @eriknw to consider the alternative implementation. Personally I would favor the original for its simplicity.

eriknw commented 1 year ago

Yeah, simple is good here.

@groutr's exploration of the rabbit hole is pretty interesting though, and the test_iterable function is instructive. Also, +1 for the additional tests :)