grantjenks / python-sortedcontainers

Python Sorted Container Types: Sorted List, Sorted Dict, and Sorted Set
http://www.grantjenks.com/docs/sortedcontainers/
Other
3.54k stars 206 forks source link

Improve Sorted Set Merge Performance for Disjoint Sets #53

Open ereOn opened 8 years ago

ereOn commented 8 years ago

I ran a simple benchmark to compare the merge speed between SortedSet and a regular (Python 2.7) set. I expected the merge of sorted sets to be faster but it turns out it's dramatically slower. This seems weird to me, especially in the "chain" case where I'd expect the merge to be a simple concatenation as the min and max of each do not overlap.

Here is my benchmark and its results:

import timeit

from sortedcontainers import SortedSet

def main():
    print timeit.timeit(
        setup='''
a = set(xrange(0, 5000))
b = set(xrange(5000, 10000))
        ''',
        stmt='''
c = a | b
        ''',
        number=10000,
    ), "default sets - chained"
    print timeit.timeit(
        setup='''
a = set(xrange(0, 10000, 2))
b = set(x + 1 for x in xrange(0, 10000, 2))
        ''',
        stmt='''
c = a | b
        ''',
        number=10000,
    ), "default sets - interleaved"
    print timeit.timeit(
        setup='''
from sortedcontainers import SortedSet
a = SortedSet(xrange(0, 5000))
b = SortedSet(xrange(5000, 10000))
        ''',
        stmt='''
c = a | b
        ''',
        number=10000,
    ), "sorted sets - chained"
    print timeit.timeit(
        setup='''
from sortedcontainers import SortedSet
a = SortedSet(xrange(0, 10000, 2))
b = SortedSet(x + 1 for x in xrange(0, 10000, 2))
        ''',
        stmt='''
c = a | b
        ''',
        number=10000,
    ), "sorted sets - interleaved"

if __name__ == '__main__':
    main()

Which gives:

 python sets_benchmark.py
0.855157199341 default sets - chained
0.947277746058 default sets - interleaved
8.58410127337 sorted sets - chained
8.57208783211 sorted sets - interleaved

Could you confirm/infirm my expectation regarding these results ?

Thanks.

grantjenks commented 8 years ago

You are correct. Locally I get these results:

$ python benchmark.py 
0.939733028412 default sets - chained
1.01636219025 default sets - interleaved
7.32243204117 sorted sets - chained
7.37890100479 sorted sets - interleaved

Merging sorted sets appears 7-8x slower. The performance between chained and interleaved values is nearly identical.

Let's look at why. The code for SortedSet.union is:

def union(self, *iterables):
    """
    Return a new SortedSet with elements from the set and all *iterables*.
    """
    return self.__class__(chain(iter(self), *iterables), key=self._key, load=self._load)

__or__ = union
__ror__ = __or__

So a new SortedSet is constructed by chaining self and iterables together. Iterating self is very fast but let's measure:

    print timeit.timeit(
        setup='''
a = set(xrange(0, 5000))
        ''',
        stmt='''
list(a)
        ''',
        number=10000,
    ), "default sets - iteration"
    print timeit.timeit(
        setup='''
from sortedcontainers import SortedSet
a = SortedSet(xrange(0, 5000))
        ''',
        stmt='''
list(a)
        ''',
        number=10000,
    ), "sortedcontainers.SortedSet - iteration"

And that takes:

0.296840906143 default sets - iteration
0.444090127945 sortedcontainers.SortedSet - iteration

So iterating self and iterables will take 50% longer than native sets and iterating both will take about 90% of constructing a new native set.

When a new SortedSet is initialized, a native set is created and then a SortedList is initialized from that. Since native sets use hashing, they could shuffle the sort order. Let's check:

In [9]: it1 = iter(set(xrange(0, 5000)) | set(xrange(5000, 10000)))
In [10]: it2 = iter(it1)
In [11]: all(first < second for first, second in zip(it1, it2))
Out[11]: True

That's kind of surprising to me. But given the design of sets I guess it's not too surprising. The iterated union of "chained" native sets is sorted. What if the native sets are interleaved:

In [12]: it1 = iter(set(xrange(0, 10000, 2)) | set(xrange(1, 10000, 2)))
In [13]: it2 = iter(it1)
In [14]: all(first < second for first, second in zip(it1, it2))
Out[14]: True

I'm impressed the iteration order is still sorted. The iterated union of "interleaved" native sets is also sorted.

That means the cost of creating a SortedSet is:

So we must be slower than native set union because SortedSet calls the native set constructor. That means the majority of slowdown comes from creating SortedList. The SortedList constructor is just a call to sorted followed by list slicing. How much slower is sorted than constructing a list?

In [15]: %timeit sorted(xrange(10000))
1000 loops, best of 3: 200 µs per loop

In [16]: %timeit list(xrange(10000))
10000 loops, best of 3: 74.8 µs per loop

Calling sorted is ~3x slower than calling list. Slicing the result into list segments will require another iteration. So the sources of slowdown:

  1. Iterating a SortedSet is slower than iterating a native set. The larger your sets get then the less this will be the case.
  2. SortedSet has to construct a native set which will take at least as long as the native set union operation.
  3. SortedSet creates a SortedList which calls sorted. The sorted built-in function has to compare values and so calls __lt__ for every value in the list. That operation is slower than __hash__ for historical reasons in Python. (Python's history of rich comparison operators has made the C code fairly complex.)
  4. After calling sorted the result is chopped up into segments which is another linear pass over the data.

Before we draw conclusions, let's increase the size of the sets to avoid slowdowns based on one-offs like calling chain or iter. For sorted sets 100x larger, here's the performance:

In [21]: %%timeit a = SortedSet(xrange(0, 500000)); b = SortedSet(xrange(500000, 1000000))
a | b
   ....: 
10 loops, best of 3: 107 ms per loop

In [22]: %%timeit a = set(xrange(0, 500000)); b = set(xrange(500000, 1000000))
a | b
   ....: 
10 loops, best of 3: 25.7 ms per loop

Now SortedSet is ~316% slower than native set, or 3x slower. That 3x slowdown corresponds to three linear passes over the values:

  1. Constructing a native set: linear pass that calls __hash__ once on every value.
  2. The sorted built-in: linear pass that calls __lt__ once for every value.
  3. Splitting the sorted result into segments: linear memcopy-like pass.

I would guess that sorted is the slowest part of the process. To improve performance we would need to call __lt__ less often. And we can do that because we know the other values are already sorted. Here's how we could refactor union (warning: not thoroughly tested):

    print timeit.timeit(
        setup='''
from sortedcontainers import SortedSet

class FasterSortedSet(SortedSet):
    def union(self, *iterables):
        if (self and len(iterables) == 1
                and isinstance(iterables[0], SortedSet)
                and iterables[0]
                and self._list._lists[-1][-1] < iterables[0]._list._lists[0][0]):
            that = iterables[0]
            result = FasterSortedSet()
            _set = result._set
            _set.update(self)
            _set.update(that)
            _list = result._list
            _list._lists.extend(list(sublist) for sublist in self._list._lists)
            _list._maxes.extend(self._list._maxes)
            _list._lists.extend(list(sublist) for sublist in that._list._lists)
            _list._maxes.extend(that._list._maxes)
            _list._len = len(_set)
            return result
        else:
            return self.__class__(
                chain(iter(self), *iterables),
                key=self._key,
                load=self._load
            )

    __or__ = union
    __ror__ = __or__

a = FasterSortedSet(xrange(0, 5000))
b = FasterSortedSet(xrange(5000, 10000))
        ''',
        stmt='''
c = a | b
        ''',
        number=10000,
    ), "FastSortedSet - chained"

When I run that the result is:

0.900471925735 default sets - chained
2.85443711281 FasterSortedSet - chained
6.6682741642 SortedSet - chained

I think that's as fast as we can get. FasterSortedSet is ~2.3x faster than SortedSet in this scenario but it did so by changing the internals of SortedList. Changing SortedList internals introduces tight coupling between the data types which I try to avoid. I'm not convinced this is necessary but I'm interested to hear more about your use case.

earonesty commented 7 years ago

It seems to me that this can be written as the general case of performing an operation with an "already sorted by my key" container as an argument. If all sorted containers advertised that they were sorted via a "sorted" attribute containing their key or None, then the internals of the SortedList can merely check that "sorted" attribute, and proceed without re-sorting, and not knowing too much about whether it's part of a set union.

In other words, a minor change to the "update" function in the SortedList, skipping the sort if the hasattr(iterable,"sorted") and iterable.sorted==self.key.

(I've always wanted a "sorted" flag to be a first class container citizen... with the sorted() function automatically setting this flag to the key used, and any unknown operations clearing it.)

grantjenks commented 7 years ago

The "sorted" attribute is kind of already present given the type. Good catch on checking the key function as well. I could expose the key function as a readonly attribute. Then you could simply check: "isinstance(other, SortedList) and other.key == self.key"

There has also been some discussion of adding Sorted as a kind of collections ABC. If you work on a proposal then I would be glad to contribute. Andrew Barnert also had some thoughts: http://stupidpythonideas.blogspot.com/2013/07/sorted-collections-in-stdlib.html

I was also thinking that a SortedSet.fromset classmethod could be a more useful API than the current _set parameter used to initialize sorted sets. Maybe SortedList.fromsorted is also warranted and would assume the input is sorted.

grantjenks commented 6 years ago

Couple updates:

1) sorted list, sorted dict, and sorted set all expose a "key" attribute now which could be used for the comparison checking.

2) The sorted set data type now has "_fromset" as described above. But sorted list does not have "_fromsorted". That's still an interesting idea.