python / cpython

The Python programming language
https://www.python.org
Other
61.97k stars 29.8k forks source link

Optimization for set/frozenset.issubset() #62232

Closed 25d4b5bc-6ae1-4c52-a97c-78e3aa3e5594 closed 9 years ago

25d4b5bc-6ae1-4c52-a97c-78e3aa3e5594 commented 11 years ago
BPO 18032
Nosy @rhettinger, @vstinner, @ezio-melotti, @serhiy-storchaka, @phmc, @MojoVampire
Files
  • issubset_improvement.patch
  • 0001-Improve-set.issubset-performance-on-iterables.patch: Patch that improves set.issubset() perf. on iterables
  • set_issubset_bitarray.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['interpreter-core', 'easy', 'performance'] title = 'Optimization for set/frozenset.issubset()' updated_at = user = 'https://bugs.python.org/hhm' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Interpreter Core'] creation = creator = 'hhm' dependencies = [] files = ['32706', '37295', '39516'] hgrepos = [] issue_num = 18032 keywords = ['patch', 'easy'] message_count = 11.0 messages = ['189806', '190133', '203416', '221338', '221382', '231760', '244155', '244158', '244163', '244246', '244250'] nosy_count = 9.0 nosy_names = ['rhettinger', 'vstinner', 'ezio.melotti', 'hhm', 'dhaffner', 'serhiy.storchaka', 'pconnell', 'josh.r', 'bru'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'patch review' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue18032' versions = ['Python 3.5'] ```

    25d4b5bc-6ae1-4c52-a97c-78e3aa3e5594 commented 11 years ago

    It says here (http://docs.python.org/2/library/stdtypes.html#set-types-set-frozenset) that some of the set methods take iterables as a parameter.

    Usually, the expected behavior is for a iterator consumer to consume only as much data as it needs. For example, for the code any(itertools.repeat(True)), the code will complete speedily, in contrast to when all() is used instead of any(), in which case the code will go forever.

    A least some of the set methods have semantics such that they can consume only some of the input; for example, "issubset" only needs to make sure that all of the items in the set are in the iterable, and once this is the case, it can return "True".

    However in such a case, the methods will *still* go forever.

    The docs should specify that this is the case, to disambiguate the semantics.

    (Tested on Python 3.2.3 and 2.7.3).

    rhettinger commented 11 years ago

    We don't normally document implementation details or the presences or absence of internal optimizations. This gives us freedom to change the implementation and it allows freedom for other implementations (such as Jython, IronPython, and PyPy) to make their own choices. Often those implementations start out with the simplest correct approach and then become increasingly optimized over time.

    I'm leaving this one open as a possible CPythhon performance enhancement, adding early-out behavior to issubset() when the argument is a non-set, non-dict iterable:

        def issubset(self, iterable):
            n = len(self)
            seen = set() 
            for x in iterable:
                if x not in seen and x in self:
                    seen.add(x)
                    n -= 1
                    if not n:
                        return True             # early-out
            return False
    9e571daf-f377-4861-9bc4-4ab3d3145d0a commented 10 years ago

    I've made an attempt at patching set_issubset() to match the Python from Raymond's message. I've rearranged the set_issubset function so that it checks for a set/frozenset, dict, or iterable in that order. In the iterable case it will create a temporary set and add elements to it as it consumes them from the iterable, returning early when possible.

    This is my first patch, please let me know how I may improve it!

    rhettinger commented 10 years ago

    The patch looks good. I'll go over it in more detail shortly.

    99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 10 years ago

    Patch needs some work. See comments on patch.

    107570b1-f529-4366-a9b3-347e6fceb4c9 commented 9 years ago

    Here is an updated patch based on Dustin's work with Josh's comments. I also added a test which takes forever on an unpatched python interpreter.

    Since it's a performance issue, I've benchmarked the results. They don't change for the most part (argument is a set or a dict) but they're way better for iterables. For every type of argument I test 1 case where "set.issubset" returns True and 1 case where it returns False.

    (a) simple argument (results unchanged)

    $ ./python -m timeit -s "s1 = set(range(1000)); s2 = set(range(1000))" "s1.issubset(s2)"
    Unpatched: 10000 loops, best of 3: 63.7 usec per loop
    Patched:   10000 loops, best of 3: 63.5 usec per loop
    
    $ ./python -m timeit -s "s1 = set(range(1000)); s2 = set(range(1, 1000))" "s1.issubset(s2)"
    Unpatched: 1000000 loops, best of 3: 0.248 usec per loop
    Patched:   1000000 loops, best of 3: 0.25 usec per loop
    
    $ ./python -m timeit -s "s1 = set(range(1000)); s2 = dict(enumerate(range(1000)))" "s1.issubset(s2)"
    Unpatched: 10000 loops, best of 3: 107 usec per loop
    Patched: 10000 loops, best of 3: 108 usec per loop
    
    $ ./python -m timeit -s "s1 = set(range(1000)); s2 = dict(enumerate(range(1, 1000)))" "s1.issubset(s2)"
    Unpatched: 10000 loops, best of 3: 43.5 usec per loop
    Patched:   10000 loops, best of 3: 42.6 usec per loop

    (b) iterable argument (speed improvement)

    1) no improvements/slight degradation when everything must be consumed

    $ ./python -m timeit -s "s1 = set(range(1000))" "s1.issubset(range(1000))"
    Unpatched: 1000 loops, best of 3: 263 usec per loop
    Patched:   1000 loops, best of 3: 263 usec per loop
    
    $ ./python -m timeit -s "s1 = set(range(1000))" "s1.issubset(range(1, 1000))"
    Unpatched: 10000 loops, best of 3: 201 usec per loop
    Patched:   1000 loops, best of 3: 259 usec per loop
    
    $ ./python -m timeit -s "s1 = set(range(100))" "s1.issubset(range(1, 1000))"
    Unpatched: 1000 loops, best of 3: 198 usec per loop
    Patched:   1000 loops, best of 3: 218 usec per loop

    2) tremendous improvements when it can return early

    $ ./python -m timeit -s "s1 = set(range(100))" "s1.issubset(range(1000))"
    Unpatched: 1000 loops, best of 3: 209 usec per loop
    Patched:   100000 loops, best of 3: 12.1 usec per loop
    
    $ ./python -m timeit -s "s1 = set('a'); s2 = ['a'] + ['b'] * 10000" "s1.issubset(s2)"
    Unpatched: 1000 loops, best of 3: 368 usec per loop
    Patched:   1000000 loops, best of 3: 0.934 usec per loop
    
    $ ./python -m timeit -s "s1 = set('a'); from itertools import repeat" "s1.issubset(repeat('a'))"
    Unpatched: NEVER FINISHES
    Patched:   1000000 loops, best of 3: 1.33 usec per loop
    serhiy-storchaka commented 9 years ago

    In C implementation no need to create set object seen. More efficient way is to use bit array.

    Here is a patch that uses this approach.

    ./python -m timeit -s "s1 = set(range(1000))" "s1.issubset(range(1000))" Unpatched : 10000 loops, best of 3: 115 usec per loop bru's patch: 10000 loops, best of 3: 114 usec per loop My patch : 10000 loops, best of 3: 92.6 usec per loop

    ./python -m timeit -s "s1 = set(range(1000))" "s1.issubset(range(1, 1000))" Unpatched : 10000 loops, best of 3: 73.4 usec per loop bru's patch: 10000 loops, best of 3: 114 usec per loop My patch : 10000 loops, best of 3: 93 usec per loop

    ./python -m timeit -s "s1 = set(range(100))" "s1.issubset(range(1, 1000))" Unpatched : 10000 loops, best of 3: 73.6 usec per loop bru's patch: 10000 loops, best of 3: 89 usec per loop My patch : 10000 loops, best of 3: 62.4 usec per loop

    ./python -m timeit -s "s1 = set(range(100))" "s1.issubset(range(1000))" Unpatched : 10000 loops, best of 3: 75.5 usec per loop bru's patch: 100000 loops, best of 3: 8.77 usec per loop My patch : 100000 loops, best of 3: 5.34 usec per loop

    ./python -m timeit -s "s1 = set('a'); s2 = ['a'] + ['b'] * 10000" "s1.issubset(s2)" Unpatched : 1000 loops, best of 3: 326 usec per loop bru's patch: 1000000 loops, best of 3: 0.394 usec per loop My patch : 1000000 loops, best of 3: 0.409 usec per loop

    ./python -m timeit -s "s1 = set('a'); from itertools import repeat" "s1.issubset(repeat('a'))" Unpatched : NEVER FINISHES bru's patch: 1000000 loops, best of 3: 0.794 usec per loop My patch : 1000000 loops, best of 3: 0.725 usec per loop

    107570b1-f529-4366-a9b3-347e6fceb4c9 commented 9 years ago

    Serhiy, that sounds good but I think that you have forgotten to attach the mentioned patch.

    serhiy-storchaka commented 9 years ago

    Oh, sorry. Here is it.

    serhiy-storchaka commented 9 years ago

    Personally I don't sure that this optimization is worth to apply. Its cost is high and optimized case is not common. This is rather an experiment, how large can be maximal effect of the optimization.

    rhettinger commented 9 years ago

    Personally I don't sure that this optimization is worth to apply.

    I concur. Closing as "not worth it" :-)