Closed eriknw closed 10 years ago
To summarize my results with get
:
Just to be clear "x1" would be the same execution time, and "x2" would be half the execution time.
Awesome. Though note that curry(get)(0)
is only about 1.5 times faster than curry(get)(0)
in toolz. The use of partial
pays off.
Awesome. Though note that curry(get)(0) is only about 1.5 times faster than curry(get)(0) in toolz. The use of partial pays off.
I'm not ready to concede the point of using partial
in toolz.curry
when the function may have keyword arguments. curry
is not the same as partial
. When there are only positional arguments, curry
has well-defined semantics: f(x, y, z)
can be incrementally evaluated as f(1)(2)(3)
. If there are both positional arguments and keyword arguments, the semantics become less obvious. I would expect curry
to allow incremental additions (and changes) to keyword arguments as follows: f(x, y=None, z=None)
can be incrementally evaluated as f(y=1)(z=2)(y=3)(4)
. For example:
# we plan to use `memoize` with this key function many places
memoize_onfirst = memoize(key=lambda *args, **kwargs: args[0])
f_cache = {}
@memoize_onfirst(cache=f_cache)
def f(x, y, z):
...
Such usage is incompatible with returning a partial
object for get
(because get
has a keyword argument). Moreover, if you want almost all of the convenience of using and typing get
and the speed of partial
, you can use this one-liner: pget = lambda *args, **kwargs: partial(get, *args, **kwargs)
.
By the way, I had to make cytoolz.curry
a little slower for a while in order to pass tests (have I mentioned how much I love having pre-existing tests? Well, I do!). I now understand the issue I was having, and cytoolz.curry
is once again nearly as fast as partial
. For me curry(get)(0)
in cytoolz
is now 1.9 time faster than curry(get)(0)
in toolz
. Somehow, I think this looks good for both toolz
and cytoolz
(even though toolz
needed to make a compromise to gain performance)!
I agree with you about the issue with partial
, we do sacrifice correctness here. I've optimized pretty heavily for performance in the past. My only defense for this strategy is that correctness issues haven't come up.
Maybe with cytoolz
around though this can change. Maybe instead of rarely sacrificing correctness we now optionally sacrifice the pure-python bit. This is nice because the user can make this choice himself (by import cytoolz rather than toolz) rather than us making the curry/partial decision for him.
I plan to talk about functional programming and core data structures at PyData Silicon Valley, May 2-4 and I'd like to show off CyToolz. It would be nice to release a blogpost or two about it beforehand. Are you working on a timeline of any sort?
Oh, cool. I'll be interested to hear how it goes.
I expect to be very busy the last two weeks of April, so I plan to get cytoolz
into a beta state within a week. I think this is doable. "beta" to me means full coverage of toolz
API, pip-installable on Linux (and possibly OS X) with and without Cython, and at least a little documentation. Your engagement is appreciated. I was initially thinking early May for my blog post. Hopefully I can whip it out sooner, but reaching beta status is my main priority.
@mrocklin, you may be interested that I ran CyToolz versions of the "Text Benchmarks" from here:
http://matthewrocklin.com/blog/work/2014/01/13/Text-Benchmarks/
I made one modification for the CyToolz versions: I used imap
(actually cytoolz.compatibility.map
), which was actually faster than using map
.
The straight CyToolz version from Python is about 45% faster than with Toolz (i.e., I just changed import from toolz
to cytoolz
). For kicks, I ran the same thing in a Cython-compiled module that exposes a function that does the processing. This ran about 65% faster than with Toolz.
That's awesome. Python, now beating Julia and Java on data structure processing.
I mentioned cytoolz during a Blaze meeting (I now work on Blaze). @FrancescAlted seemed pretty interested.
BTW, numpy.bincount
beats both cytoolz.frequencies
and pandas.groupby(...).size()
I mentioned cytoolz during a Blaze meeting (I now work on Blaze). @FrancescAlted seemed pretty interested.
Awesome! Any help, feedback, and additional expertise in Cython is appreciated.
I like what I've seen of Blaze, but I've never had a reason to learn it or chance to use it (but nor have I worked on "big data" appropriate for Blaze since its existence). Is it used much outside of Continuum's customer base?
I like the size and scope of toolz
. Essentially, it is my sine qua non for functional data handling and analysis, which can be seamlessly integrated with other Python code.
BTW, numpy.bincount beats both cytoolz.frequencies and pandas.groupby(...).size()
Yeah, but who knows about and uses numpy.bincount
?! I'm not terribly surprised by this result. There is a small performance penalty for handling generic, streaming data lazily. Can numpy.bincount
be lazy? Also, how much faster is it?
It is significantly faster (50x). But no to laziness and no to generic types. Don't get me wrong, I always reach for toolz
first, this is mostly because it matches my mental programming model more closely and because it is fast enough. Things like cytoolz
I mostly see as enabling me to stay within that mental model even under more strict performance requirements.
The excitement with cytoolz.frequencies
competing with pandas
was that we thought it might be possible, in some cases, that we could reach all the performance that can be reasonably expected from a single core while staying within the functional model. At least for this application numpy.bincount
does provide a compelling counter example.
All good points. I wonder if pandas
could take advantage of numpy.bincount
. If not generally, what about under certain circumstances?
cytoolz
now matches the toolz
API! (minus the sandbox)
Woot!
On Fri, Apr 11, 2014 at 11:04 AM, Erik Welch notifications@github.comwrote:
cytoolz now matches the toolz API! (minus the sandbox)
Reply to this email directly or view it on GitHubhttps://github.com/pytoolz/toolz/issues/155#issuecomment-40235907 .
Another positive outcome of this effort: there are a few faster implementations that could be back-ported to toolz
. I don't recall which ones, though; my head is in a bit of a whirlwind. Heh, the final 10 functions added to cytoolz
seemed to take as long to develop as the first 40!
I'm going to spend a few days away from cytoolz
, but, if you'd like, I encourage you to take charge of testing, developing, documenting, and playing around with it and reporting any problems or annoyances.
Blaze has me pretty hard right now, sorry that I haven't been as responsive. The top of my cytoolz priority list is to write a blogpost and a PyToolz doc page. Maybe we can get other people on board to help with the other things?
This is really fantastic work by the way, I know I've said that before, but it should really probably be said every day or two.
Blaze has me pretty hard right now, sorry that I haven't been as responsive.
New job, right? Obviously top priority.
The top of my cytoolz priority list is to write a blogpost and a PyToolz doc page.
Perfect!
This is really fantastic work by the way, I know I've said that before, but it should really probably be said every day or two.
Thanks, I'm glad you think so and said so!
This began as a way to learn and explore Cython with non-trivial tasks that didn't match the given examples. toolz
was a great candidate for many reasons: each function is pure and mostly independent (hence, easy to implement piece-by-piece), the API is already defined, tests already exist(!), there are a wide variety of functions, and the functions don't conform to typical Cython examples. I had no idea whether it would be worth creating an entire package--cytoolz
--but I was definitely curious how the performance would compare to toolz
, which served as a motivator. Even though becoming more familiar with Cython was my first priority, I thought this could turn into a separate package if it were feasible to do and the performance justified it. Even if it were those things, I am certain I wouldn't have developed cytoolz
if not for your enthusiasm and support. Thanks!
toolz
is a great package on its own merits, and cytoolz
merely adds to its value. It may seem odd that some Python users care so much about performance, but the truth is that many users do (especially in the scientific community). I think that just the existence of cytoolz
--that a higher performance alternative is available--will make some people more comfortable with using toolz
. Hopefully*toolz
gets more coverage and traction among users this year--let's make it so! There is a niche to be filled. (Alright, I've made a decision: I'm going to try to present toolz
at PyOhio this year)
I think we can close this now that CyToolz has been released. Are there any lingering issues from our discussions here?
I'm closing this issue. cytoolz
is well-established, and it has successfully been updated and maintained for a release cycle, which didn't turn out to be too painful due to automated tests that verify consistency between toolz
and cytoolz
. I also use a simple script to copy tests over from toolz
, so this isn't a big deal either so long tests get back-ported to toolz
(which was also done this release cycle).
The discussion in the PR remains an interesting read, but there isn't much to add, nor is this the proper place to continue the discussion. Closing.
What do you think about having a Cython implementation of
toolz
that can be used as a regular C extension in CPython, or becimport
-ed by other Cython code?I've been messing around with Cython lately, and I became curious how much performance could be gained by implementing
toolz
in Cython. I am almost finished with a first-pass implementation (it goes quickly when one doesn't try to fine-tune everything), and just have half ofitertoolz
left to do.Performance increases of x2-x4 are common. Some perform even better (like x10), and a few are virtually the same. There is also less overhead when calling functions defined in Cython, which at times can be significant regardless of how things scale.
However, performance when called from Python isn't the only consideration. A common strategy used by the scientific, mobile, and game communities to increase performance of their applications is to convert Python code that is frequently run to Cython. Developing in Cython also tends to be very imperative. A Cython version of
toolz
will allow fast implementations to be used in other Cython code (viacimport
) while facilitating a more functional style of programming.Looking ahead,
cython.parallel
exposes OpenMP at a low level, which should allow for more efficient parallel processing.Thoughts? Any ideas for a name? I am thinking
coolz
, becausectoolz
andcytoolz
sound like they are utilities for C or Cython code. I can push what I currently have to a repo once it has a name. Should this be part of pytoolz?