recipes.partitionby slower than itertools.groupby

itdaniher commented 7 years ago

Not surprising after reading the source and discovering partitionby merely wraps groupby, but my use of groupby is a bottleneck for run-length encoding and I was hoping to see gains by leveraging cytoolz.

I welcome suggestions and would be glad to provide sample code when at my desk.

Thanks!

MSeifert04 commented 7 years ago

When you say "run-length encoding", on what kind of data? Processing arrays (like numpy-arrays or strings) is really slow in Python because the values need to be boxed when iterated over. I think it would be good to see a sample code including "typical inputs".

eriknw commented 7 years ago

Hooray for optimization issue! I forgot that partitionby hasn't been fully cythonized. I suspect that itertools.groupby is already pretty efficient, so there probably isn't a whole lot of room for improvement. We can still try though.

To echo what both of you have said, it would be great to have sample benchmarks that reflect your use case.

itdaniher commented 7 years ago

Okay, 15 hours of travel later... here it goes!

(end of vacation, yak shaving is my version of solitaire)

import cytoolz as cz
import itertools as it
import numpy as np

ilen = cz.count

x = np.random.randint(0,2,size=100000,dtype='l')
crle = lambda xs: ((ilen(gp), gp[0]) for gp in cz.recipes.partitionby(lambda x: x, xs))
irle = lambda xs: ((ilen(gp), x) for x, gp in it.groupby(xs))

make sure apples == apples:

[i for i in crle(x)] == [i for i in irle(x)]
>>> True

put it to the test:

timeit.timeit('irle(x)',number=100000,globals=globals())

gives 0.03869020694401115

timeit.timeit('crle(x)',number=100000,globals=globals())

gives 0.06634469900745898

repeated tests for a few different dtypes, seems consistent.

edit: Thanks + mad props to all for your great work making fast-n-fun Python faster-n-more-fun. edit: add other name for cz.count which is insanely faster than len([_ for _ in x]) - the trivial case for getting the length of an iterator

MSeifert04 commented 7 years ago

hm, you're only timing how long it takes to create the generator, not how long it takes to exhaust it.

assert list(crle(x)) == list(irle(x))
# Creating the generator
%timeit crle(x)  # 100000 loops, best of 3: 10.1 µs per loop
%timeit irle(x)  # 100000 loops, best of 3: 3.22 µs per loop
# Iterating over the generator
%timeit list(crle(x))  # 1 loop, best of 3: 228 ms per loop
%timeit list(irle(x))  # 10 loops, best of 3: 103 ms per loop

However if you want partitionby to be faster than replace the lambda x: x with None:

crle = lambda xs: ((ilen(gp), gp[0]) for gp in cz.recipes.partitionby(None, xs))

assert list(crle(x)) == list(irle(x))
%timeit list(crle(x))  # 10 loops, best of 3: 164 ms per loop
%timeit list(irle(x))  # 10 loops, best of 3: 104 ms per loop

At this point the bottleneck is that numpy-arrays are slow when iterated over, even with itertools et. al., there are some ways to make it faster, for example converting it to a list:

%timeit list(crle(x.tolist()))  # 10 loops, best of 3: 126 ms per loop
%timeit list(irle(x.tolist()))  # 10 loops, best of 3: 67.8 ms per loop

The time difference here is mostly due to the fact that groupby doesn't need to create intermediate sequences before counting them. So the next step would be to use normal len in partitionby:

crle = lambda xs: ((len(gp), gp[0]) for gp in cz.recipes.partitionby(None, xs))
%timeit list(crle(x.tolist()))  # 10 loops, best of 3: 89.4 ms per loop
%timeit list(irle(x.tolist()))  # 10 loops, best of 3: 69.1 ms per loop

Still a bit slower but the difference isn't all that much anymore.

You're working with homogeneous array-data, so you should consider switching to a numpy-based solution instead of working in pure-Python (or using PyPy?). Just to give an example using numba (just some draft, I don't think that function is really stable):

import numba as nb

@nb.njit
def counts(arr):
    res = np.zeros((arr.size, 2), dtype=np.int32)
    i = 0
    cnts = 0
    last = None
    for item in arr:
        if last is None:
            last = item
        if item == last:
            cnts += 1
        else:
            res[i, 0] = cnts
            res[i, 1] = last
            last = item
            cnts = 1
            i += 1
    res[i, 0] = cnts
    res[i, 1] = last
    i += 1
    return res[:i]

assert [tuple(inner) for inner in counts(x).tolist()] == list(crle(x.tolist()))
%timeit list(crle(x.tolist()))  # 10 loops, best of 3: 89.9 ms per loop
%timeit list(irle(x.tolist()))  # 10 loops, best of 3: 71.8 ms per loop
%timeit counts(x)  # 1000 loops, best of 3: 854 µs per loop

So this can speed up the function by a factor of 100.

I'm not sure what the takeaway message of this comment should be: I think partitionby is really useful because it's much easier to work with than groupby. However where groupby has it's weird quirks it can be faster than partitionby. But: If you're working with numeric and homogeneous data and you need performance, then you probably should stay in the numpy-domain (maybe pandas).

itdaniher commented 7 years ago

hm, you're only timing how long it takes to create the generator, not how long it takes to exhaust it.

Failure of the design of my provided test case, thanks for working through it. The observation stems from a software system that exhausts the iterator. It looks like my observation holds under tighter scrutiny, although the gap is narrowed.

But: If you're working with numeric and homogeneous data and you need performance, then you probably should stay in the numpy-domain (maybe pandas).

I've built a prototype of a soft-realtime dataprocessing system in Python. I'll rewrite it in Rust when I have nothing better to do, but surprisingly Python is more than "fast enough."

I'm not sure what the takeaway message of this comment should be

Besides that y'all are awesome? I think "check out pypy / numba / cython" is about what I got.

Cheers!

pytoolz / cytoolz

recipes.partitionby slower than itertools.groupby #99