pytoolz / toolz

A functional standard library for Python.
http://toolz.readthedocs.org/
Other
4.7k stars 263 forks source link

Cython implementation of toolz #155

Closed eriknw closed 10 years ago

eriknw commented 10 years ago

What do you think about having a Cython implementation of toolz that can be used as a regular C extension in CPython, or be cimport-ed by other Cython code?

I've been messing around with Cython lately, and I became curious how much performance could be gained by implementing toolz in Cython. I am almost finished with a first-pass implementation (it goes quickly when one doesn't try to fine-tune everything), and just have half of itertoolz left to do.

Performance increases of x2-x4 are common. Some perform even better (like x10), and a few are virtually the same. There is also less overhead when calling functions defined in Cython, which at times can be significant regardless of how things scale.

However, performance when called from Python isn't the only consideration. A common strategy used by the scientific, mobile, and game communities to increase performance of their applications is to convert Python code that is frequently run to Cython. Developing in Cython also tends to be very imperative. A Cython version of toolz will allow fast implementations to be used in other Cython code (via cimport) while facilitating a more functional style of programming.

Looking ahead, cython.parallel exposes OpenMP at a low level, which should allow for more efficient parallel processing.

Thoughts? Any ideas for a name? I am thinking coolz, because ctoolz and cytoolz sound like they are utilities for C or Cython code. I can push what I currently have to a repo once it has a name. Should this be part of pytoolz?

mrocklin commented 10 years ago

This is amazing. I'd love to play with it, please push!

Particular notes:

  1. Yes, I would like this to be part of PyToolz
  2. What about toolz importing coolz functions instead if they exist?
  3. It would be interesting to see timings for functions that deal with a fair amount of data but that don't necessarily depend on a Python function input. Pluck and frequencies come to mind.
  4. The thought of threading is interesting. Lets consider a parallel threaded map for a moment. Presumably we'll still block on the input Python function. What if this were a Cython function? We would need to call coolz.threaded.map from Cython not from Python to get significant parallelism; is that correct? Again, functions like frequencies and pluck would get past this.
  5. How effectively does Cython handle laziness (I haven't gotten my feet wet with Cython yet)
  6. Naming is hard. I think that you can change the name of repos on github. Most of the pain is probably in internal grepping/replacing. I agree with you about ctoolz and cytoolz although to me coolz also sounds odd. I don't have any suggestions at the moment though so I say go with it. toolc? toolzc (is zc a common sound in some language? Hungarian?) The ever creative cythontoolz?
eriknw commented 10 years ago

I'd love to play with it, please push!

I'll try to get to it this evening!

2 What about toolz importing coolz functions instead if they exist?

Not sure. You lose introspective capabilities, including descriptive tracebacks that show the lines around the exception, and the code isn't easily viewed (such as with ?? in IPython).

3 It would be interesting to see timings for functions that deal with a fair amount of data but that don't necessarily depend on a Python function input. Pluck and frequencies come to mind.

I haven't done pluck yet, but I have two competing implementations of frequencies:

Impl. All the same All different
toolz 2.22 ms 7.02 ms
coolz1 1.03 ms 1.21 ms
coolz2 1.33 ms 1.10 ms

I plan to make benchmark comparisons easier by using the timeit module, which will also make it easier to test various Cython implementations.

4 The thought of threading is interesting

I looked into this a bit more. Although Cython has a nogil keyword for functions and a with nogil context manager, they appear to be of limited use. It seems releasing the GIL is only useful for calling into C code that doesn't touch Python, and for Cython functions that don't handle Python objects. I don't know all the rules yet. It appears one issue is with reference counting, which requires the GIL. It may be possible to get around this limitation for some functions by using pointers to objects, but this gets pretty ugly. Oh well.

5 How effectively does Cython handle laziness (I haven't gotten my feet wet with Cython yet)

Cython can do everything that Python does. However, the C-equivalent code is pretty limited:

  1. Closures are not supported. This has not been an issue for coolz. Extension types (cdef class) are used instead, which may include fast C-level access.
  2. yield is not supported. Extension types that use the iterator protocol are used instead (i.e., __iter__ returns self and __next__ returns the next value). The iterator protocol is a Python construct, and I haven't determined whether there is a good way for C-level access. Even if it doesn't match the speed of in-memory structures, the performance is still pretty good, and may be the best way to handle laziness.
  3. Variadic arguments as done in Python aren't supported. My approach for variadic functions is to have a non-variadic C version that the Python versions calls (and a user may call directly from Cython or C).

6 Naming is hard.

Yeah, naming is hard. cythontoolz is a contender, but I'll stick with coolz for now, because it's shorter and is what I've been using. I'm not against changing it to something better though.

mrocklin commented 10 years ago

I'm definitely looking forward to playing around with this.

eriknw commented 10 years ago

Here it is: https://github.com/eriknw/coolz

The "TODO" file shows what is not yet implemented. All the tests pass, except for the doctests for do, which require curried (which doesn't exist).

Heh, and just to reiterate, this is a first pass implementation to see if such a project is feasible and worthwhile, so nothing is uber-tweaked for performance (although I did try a few variations of each), and some things may be uglier than needed. Also, I am not (yet) an expert in Cython.

You currently need Cython to build coolz. Typically, *.c files would be distributed with the package too, so all a user would need is a C compiler. I haven't gotten that far along yet, but other packages do this.

Feedback is greatly appreciated.

Oh, and another name idea: fasttoolz.

Enjoy!

mrocklin commented 10 years ago

FYI

In [1]: import toolz
In [2]: import coolz
In [3]: import pandas

In [4]: import random
In [5]: data = [random.randint(0, 100) for i in range(100000)]

In [6]: timeit toolz.frequencies(data)
100 loops, best of 3: 8.42 ms per loop

In [7]: timeit coolz.frequencies(data)
100 loops, best of 3: 3.46 ms per loop

In [8]: df = pandas.DataFrame(data)
In [9]: timeit df.groupby(0).size()
100 loops, best of 3: 4.21 ms per loop
mrocklin commented 10 years ago

Can you recommend Cython getting started resources. What is your development workflow like here. I ended up running setup.py install to get coolz working. Is this standard?

eriknw commented 10 years ago

Awesome!

mrocklin commented 10 years ago

Projects like coolz are a good reason to keep the API as small as possible. I wouldn't mind shrinking things down to an actual core plus a lot of convenient peripherals.

eriknw commented 10 years ago

python setup build_ext --inplace is a common practice that allows you to run things from the same directory.

Resources that I have used include:

http://docs.cython.org/ https://github.com/cython/cython/wiki https://github.com/cython/cython/wiki/FAQ

I also participated in the Cython tutorial at the SciPy Conference in 2013:

http://public.enthought.com/~ksmith/scipy2013_cython/ https://github.com/kwmsmith/scipy2013-cython-tutorial http://conference.scipy.org/scipy2013/tutorial_detail.php?id=105

Although the documentation is pretty good and things for the most part just make sense, at times it seems like there is a needed but missing manual.

Projects like coolz are a good reason to keep the API as small as possible.

I absolutely (and obviously) agree!

I wouldn't mind shrinking things down to an actual core plus a lot of convenient peripherals.

What do you have in mind?

mrocklin commented 10 years ago

I'd like eventually to collapse down the directory structure so that we have e.g. itertoolz.py rather than itertoolz/core.py. Then maybe we have a straight toolz/core.py with the more serious functions. Everything is still in the flat namespace of toolz but projects like coolz might choose only to implement toolz.core. In this way users could use everything at first

from toolz import *

Then, if they wanted to use other backends they might ensure that their projects work with just the core

from toolz.core import *

And then they would be reassured that their project would work with coolz

# from toolz.core import *
from coolz import *

I think that the batteries included philosophy works well for general toolz (e.g., lets go ahead and add pluck and juxt etc..) We can balance that philosophy with supporting other projects by having an optional namespace.

mrocklin commented 10 years ago

Something that we might think about is how to run the same test suite on different projects. This is a more general problem that extends beyond c/toolz

mrocklin commented 10 years ago

In particular, I'd like to run exactly the toolz test suite against coolz. I'd like to continue to do this even as the two projects evolve.

mrocklin commented 10 years ago

I guess that all closures needed to turn into explicit classes?

mrocklin commented 10 years ago

What's the logic behind something like coolz.merge being faster than toolz.merge? They're calling exactly the same functions. Is all of the cost savings that I'm seeing purely in dispatching overhead?

In [21]: data = [{i:i**2} for i in range(100)]

In [22]: timeit t.merge(data)
10000 loops, best of 3: 28.7 µs per loop

In [23]: timeit c.merge(data)
100000 loops, best of 3: 5.3 µs per loop
mrocklin commented 10 years ago

What is the Python 3 / Cython story?

eriknw commented 10 years ago

In particular, I'd like to run exactly the toolz test suite against coolz. I'd like to continue to do this even as the two projects evolve.

Agreed, although I don't know the best way to do it. We may want coolz to maintain a local clone of toolz, then call a script to copy over the tests. There may be other ways to do it too. This should be posted as an Issue for coolz.

I guess that all closures needed to turn into explicit classes?

Yup, except I didn't do it for this one: https://github.com/eriknw/coolz/blob/master/coolz/functoolz/core.pyx#L441

What is the Python 3 / Cython story?

The C code created by Cython supports both Python 2 and Python 3 using the same code base. I think there is a Python 3 only (or first) mode, but I haven't looked into it. In other words, syntax-wise, things should "just work". On Cython's master branch, they just recently dropped Python 2.5 support.

mrocklin commented 10 years ago

I noticed that you were using dict.iteritems. So Cython in a Python variant that looks a lot like Python 2 but that creates extensions that are valid for either?

eriknw commented 10 years ago

What's the logic behind something like coolz.merge being faster than toolz.merge? They're calling exactly the same functions.

That's just the thing: they're not calling the exact same functions. PyDict_Update is lower level than dict.update. dict.update can also accept keyword arguments.

PyDict_Merge is also occasionally useful, because you can specify whether or not to override values if the key already exists.

Still, a speedup from 28.7 µs to 5.3 µs is pretty impressive!

mrocklin commented 10 years ago

Assoc is similarly impressive.

In [29]: timeit t.assoc(d, 3, 3)
1000000 loops, best of 3: 1.4 µs per loop

In [30]: timeit c.assoc(d, 3, 3)
1000000 loops, best of 3: 232 ns per loop

Honestly this starts making this sort of thing significantly more attractive. Replacing toolz with coolz in logpy (an odd project of mine) decreases the runtime of the test suite by 10%.

eriknw commented 10 years ago

I noticed that you were using dict.iteritems. So Cython in a Python variant that looks a lot like Python 2 but that creates extensions that are valid for either?

Exactly. I actually looked at the Cython code that handles this. I don't recall whether I actually tested this in Python 3 though. I think I did.

mrocklin commented 10 years ago

Pinging @microamp on this issue. He has shown interest in performance in the past.

eriknw commented 10 years ago

@mrocklin, thanks for playing around with coolz, and I'm glad you are finding it more and more attractive! Should it be cloned into PyTools where we can continue to work on it? It still needs plenty of polish.

Although I don't want to rush it, I think this would make for a good blog post on Planet SciPy (and Planet Python), which may get more people (hopefully Cython experts!) involved too.

mrocklin commented 10 years ago

Yes, we should definitely fork it into pytoolz, and yes, we should definitely write about it. Probably before these things happen we should settle on a name. Is coolz final? cythontoolz? Should we think about it a bit more?

I'm happy to help with the writing to any extent (ranging from 0-100%). My guess is that it'd be good both for you and for PyToolz as a project to have your name more strongly attached to this work than mine. Do you have a blog set up with links to the planet pages? That might be a good idea at some point. We could also publish a "guest post" from my blog.

eriknw commented 10 years ago

I think we should think about the name a bit more. tulz? :-P

I don't have a blog anywhere, and I'm undecided whether I should start one. I've never felt a strong need to have a significant online presence. At this point, a "guest post" sounds perfect. Thanks for offering to help with the writing (and affirming that my name would be prominent). If you feel compelled to write a blog post, I say go for it. Worst case (best case?) scenario is that we both begin writing a post, in which case we will either have two blog posts, or we'll merge them into a better post. I'll contribute to a post however it begins. For the near-term, though, my focus will be to improve coolz.

eriknw commented 10 years ago

For testing purposes, what do you think of having a separate package that tests implementations of the toolz API?

eriknw commented 10 years ago

I'd like eventually to collapse down the directory structure so that we have e.g. itertoolz.py rather than itertoolz/core.py. Then maybe we have a straight toolz/core.py with the more serious functions. Everything is still in the flat namespace of toolz but projects like coolz might choose only to implement toolz.core. In this way users could use everything at first ...

This sounds good.

I think we should also consider having the tests built into and distributed with toolz as is common for other scientific Python package. This way, one could run something like from toolz.tests import run_tests ; run_tests(), which gives users a "warm fuzzy" to know that the software works on their machine, and it will provide a way to run the tests on other packages like coolz. Hence, toolz remains the reference implementation with included tests, the tests remain centrally located, and there is no need to try to work with git submodules or additional Python packages, which would be messy and difficult to work with.

mrocklin commented 10 years ago

and it will provide a way to run the tests on other packages like coolz

How do we do this? Really I want something like

from toolz.tests import run_tests
import coolz
run_tests(toolz=coolz)
eriknw commented 10 years ago

Yeah, that's what I want too. I'm sure it's doable, and worth doing if we continue to develop coolz. An advantage of doing it this way is that toolz and coolz can be self-contained packages with no dependencies (except coolz needs a C compiler, and it will need toolz for testing).

coolz is clearly feasible. Is it worthwhile, and should it be developed? I vote yes. It adds to the value proposition of toolz. I'll continue to push ahead with it, but any help is appreciated.

As for the name, I prefer coolz over cythontoolz (mostly because it's shorter, and it's easier to pattern match coolz or toolz), and we haven't come up with a name I like better. I also have coolz reserved on PyPI. So, I say if we don't come up with another name by Monday, then we stick with coolz.

mrocklin commented 10 years ago

I think that this idea is worthwhile. The cost is that it expands what needs to be maintained/codeveloped along with toolz. Generally speaking though our maintenance load is very light, so I think that we're good. I would probably not start/support this if you weren't around though.

Short names are good. Another thought is that, rather than swap out the t for a c, we could swap out the z for a c and get toolc (I pronounce this as tool-see or just tools with a less buzzy s at the end). The z was already a fancy letter and both c and z kind of sound like the s in itertools/functools. This has the advantage of sounding less "cool" :)

I'm happy to help. Do you have suggestions on some issues that I could tackle to get my feet wet?

mrocklin commented 10 years ago

Hrm, I'm curious if pluck might be faster than [item[ind] for item in seq]. That would be motivating.

eriknw commented 10 years ago

Sounds like we're on the same page regarding package development and support, and, yeah, I'm taking primary responsibility for the development of coolz. It has already been a valuable learning experience.

coolz vs toolc. I've actually gone back and forth between these two for a while. Do you have a preference?

Issues I have with toolc are:

  1. Ambiguous pronunciation (hence, more difficult to talk about and for people to google)
  2. Not visually distinct enough from toolz
  3. Looks like an odd abbreviation of toolchain

Issues I have with coolz are:

  1. It sounds... hokey?
  2. It's not a very descriptive name (if one doesn't know "c" is for Cython and "oolz" is for toolz)

Positives for toolc:

  1. "toolc is toolz in Cython"
  2. keeps the "tool" moniker.
  3. It makes a little more sense than coolz

Positives for coolz:

  1. Rhymes with toolz
  2. Visually distinct from toolz
  3. More memorable than toolc

I slightly favor coolz, but if you slightly favor toolc, then I could probably be convinced to use it instead.

I'm happy to help. Do you have suggestions on some issues that I could tackle to get my feet wet?

It's a do-ocracy, so do what you prefer! My next focus will be to create a roadmap of issues that need to be done to get coolz out of alpha.

Feel free to tackle anything from the TODO list (and note that I recently knocked off a handful of itertoolz functions). When implementing new functions, I typically write a few variations in a separate file, then compile just that file. pyximport can be used to automatically compile a module, but I don't have a habit of using it yet.

Cython conveniently wraps the standard Python C API--used as from cpython.dict cimport ...--and it's worth checking it out:

https://github.com/cython/cython/tree/master/Cython/Includes/cpython

A lot of package-related stuff needs done too. A license. README. TravisCI. Building/packaging/distributing. Documentation. Once out of alpha, I think coolz/toolc should be in the toolz documentation too.

I think I'll start keeping track of "tips and tricks" that I pick up, and if you find yourself learning lessons while programming in Cython, I might suggest you do the same.

mrocklin commented 10 years ago

I slightly favor toolc. My main issue with coolz is the hokeyness. I think that having cool in the name will make people take it slightly less seriously. This could also be said for the z in toolz though. I might just favor toolc because I'm biased towards things that look like toolz :)

When I talk about toolz in the flesh I accentuate the z with a strong buzz. I often say "tool-zuh, with a zee", often with a zorro-like hand motion. Presumably one could develop a similar trick with toolc, "tool-suh, with a see".

In the end though coolz isn't my project, its yours. I do agree with you about the visual distinction being helpful.

mrocklin commented 10 years ago

Although, toolc looking like toolz isn't necessarily a very bad thing. Ideally we're building up a brand that is recognized and inspires trust. Visual cohesion is probably valuable to some limited extent.

eriknw commented 10 years ago

Thanks for the feedback. Your input is important, and I want your approval for inclusion in PyToolz.

Yeah, the hokeyness of coolz is significant. It also helps make it memorable, even if it doesn't instill trust.

I'll start referring to the project as toolc to see if it takes hold, since part of my comfort with coolz is that I'm familiar with it. Like I said before, I'm happy to wait until Monday to settle on a name.

Upon further consideration, cytoolz may not be such a bad name after all. Shorter than cythontoolz, easily pronounced, and properly toolz-branded. Heh, and if it's mistaken as "tools for Cython" in a way that dicttoolz is "tools for dicts", well, I can live with that.

mrocklin commented 10 years ago

My hokiness objection doesn't carry over to cytoolz. Seems fine to me. The coolz name works too, we're probably overthinking this.

mrocklin commented 10 years ago

CyToolz also aligns with PyToolz.

eriknw commented 10 years ago

Heh, yeah, but--like you said--naming is hard. Shall we go with cytoolz then? I'll change the name of my repo, then we can copy it over to PyToolz.

mrocklin commented 10 years ago

Here is a draft of a blogpost.

http://matthewrocklin.com/blog/work/2014/04/07/CyToolz/

It has a decent sales-pitch scaffolding but could use more meat about CyToolz itself. At the moment it's saying "PyToolz is actually pretty fast", and "CyToolz is even faster" about equally. It would be nice to tip the favor to be more about CyToolz. Maybe another section between CyToolz and Conclusion? Maybe a discussion about when using CyToolz pays off over PyToolz? I feel like I've given a sales pitch for the two projects. This is good and will get the 90% involved. It'd be nice also to have material for experts to ponder/discuss.

Repo: git@github.com:mrocklin/blog.git File: _posts/work/2014-04-07-CyToolz.md

mrocklin commented 10 years ago

Also, I don't think that the post is live anywhere except by the URL so no worries about jumping the gun.

eriknw commented 10 years ago

Wow, you whipped that up quickly. Thanks! It's great that this is in markdown and I'm able to send PRs, which I will certainly do.

I also agree with your summary above. We want to show that cytoolz is fast without showing that toolz is slow (instead, toolz is "pretty fast", and "fast enough" most of the time, which it is--and it works in any Python interpreter). Your tone is interesting, and it leaves plenty of room for me to write a blog post too.

I'd like to get cytoolz out of alpha before advertising it though. This means fully implementing the toolz API (it's almost there), and making it pip-installable using Cython or C compiler if Cython is unavailable (at least on Linux; don't know about Windows or OS X).

Out for a run. Will change repo and package names when I return.

mrocklin commented 10 years ago

I'm unemployed this week and so have lots of time. (it was a good week to tell me about cytoolz)

I used jekyll bootstrap for my blog. I've been pretty happy with it.

We could do two blogposts "Introducing CyToolz - Overview" (what I just wrote) and "Introducing CyToolz - Nuts and Bolts" (something more detailed, probably with significantly more input from your experience). I actually think that this double hit on the planet pages might be a really good approach. I often try to appeal both to the 90% masses and the 10% experts in the same post but find it frustrating. Splitting the discussion explicitly sounds good. Getting twice the vertical coverage on the page is probably also good for us.

mrocklin commented 10 years ago

Regarding an imcomplete API. Could we, as a stopgap, just import toolz.* into cytoolz and then overwrite with Cython implementations?

eriknw commented 10 years ago

...and so have lots of time. (it was a good week to tell me about cytoolz)

Excellent, what luck!

Yeah, we should each do a blog post. I think another example that compares against pandas would be good in your post. Head-to-head comparisons need to be done with in-memory data structures, but I think the other example should be something that would normally be done lazily. Actually, the example with frequencies works well with streaming input; you should point out that difference: pandas needs to be in-memory, but toolz/cytoolz may be a lazy generator or a data structure.

By the way, I renamed my repo to cytoolz. Also, I either don't know how to fork it into PyToolz, or I don't have permission to. Can you do it?

eriknw commented 10 years ago

Regarding an imcomplete API. Could we, as a stopgap, just import toolz.* into cytoolz and then overwrite with Cython implementations?

On the Python-facing side, yeah. This does introduce toolz as a dependency, and weird (or at least unexpected) things might happen if different versions of toolz and cytoolz are used.

In other words, I don't know if this is a good idea. It sounds reasonable, but there is a certain charm in having no dependencies and not needing to worry about versions.

By stopgap, do you mean you want to do this until cytoolz matches the API, which will let us "release early, release often" (which is generally regarded as a good idea)?

mrocklin commented 10 years ago

I'm a little bit concerned about synchronizing two projects. I suspect that we'll have more contributors to toolz, people will start using those new functions, and then they'll switch to cytoolz and get an import error.

mrocklin commented 10 years ago

See https://github.com/pytoolz/cytoolz

mrocklin commented 10 years ago

In other words, I don't know if this is a good idea. It sounds reasonable, but there is a certain charm in having no dependencies and not needing to worry about versions.

I agree. Perhaps this is just something to keep in mind if this sort of thing keeps happening.

Fortunate side effect of not doing this though is that we'll be less willing to accept new functions into toolz and so constrain the size of the API.

eriknw commented 10 years ago

Great, thanks! Shall we close this Issue, and continue our discussions via Issues in the "pytoolz/cytoolz" repo?

Right now I don't like having a permanent stopgap solution of flooding the cytoolz namespace with from toolz import *. Explicit is better than implicit. If a function isn't implemented in cytoolz, then the user should discover this. It is not hard to have from toolz import ... and from cytoolz import ....

I agree that if this sort of thing keeps happening, then we should revisit the stopgap solution.

Is there a way to be automatically notified when a new version of toolz is uploaded to PyPI?

Fortunate side effect of not doing this though is that we'll be less willing to accept new functions into toolz and so constrain the size of the API.

I'm glad you view this as a good thing! Your reluctance and paranoia have already paid off: if toolz were larger, I probably wouldn't have begun cytoolz.

mrocklin commented 10 years ago

Lets wait to close this once cytoolz is up on PyPI. Further discussion should probably happen in pytoolz/cytoolz issues though.

eriknw commented 10 years ago

@mrocklin, I know how much you like get, and I have a new implementation of it for you to play with. I spent more time optimizing it than I typically do, and I learned a few more tricks/lessons in the process. I think the result is pretty good--in some cases we're talking about orders of magnitude!