kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

generalizing dfply? #55

Open cunningjames opened 6 years ago

cunningjames commented 6 years ago

Hello all.

I understand this isn't the most bristling-with-activity project, but -- coming at it from the perspective of a Python tinkerer who's fallen in love with R's tidyverse -- this is really clever stuff. I've been reading through the source code and it's clear, pretty, and concise.

It is, however, deeply stitched up to Pandas. One of the neat things about dplyr / the tidyverse in general is that it works (to a greater or lesser degree!) with other sources like DBI connections or Spark. Further, it allows the programmer to bring an almost declarative paradigm where it might be warranted to tasks unrelated to data wrangling. I think some Pythonistas would frown deeply at the suggestion, but something like dfply's pipes with an FP library like toolz could be really nice from my perspective.

All that said: would you be amenable to my taking on the task of trying to generalize dfply a little? Obviously the tidyverse would be too huge an undertaking, but I could see making the pipe code more general or in the medium term something like a proof-of-concept SQL generator or SQLalchemy backend.

If not, totally fine! Though I may fork and go at it on my lonesome.

sharpe5 commented 6 years ago

I am also a huge fan of dfply, so you have my support.

One thing I would also like is the ability to call custom aggregation functions written in either Cython or numba. The reason? I have a big split/apply/combine statement that takes 15 minutes in dfply for 1 million rows, and if I hand convert it to Pandas, it takes 16 seconds. So I am writing everything in dfply, then commenting it out and replacing it with Pandas/numba for speed.

So if you are refactoring, perhaps this could be part of the end goal?

kieferk commented 6 years ago

I've actually thought about this quite a bit, and think it is a good idea. For example having dfply work on top of "backends" like pandas, pyspark, etc.

I haven't come up with anything yet that generalizes but also maintains the functionality. The piping operator is easy to generalize, for example, but once you start getting into grouping and the "symbolic" X representing the data passing through it's gotten too messy in my limited attempts to do this kind of abstraction.

All that being said, if you are able to generalize it I would obviously be thrilled. It really is the critical next step for the package.

cunningjames commented 6 years ago

I appreciate the words of encouragement! It's something I've been wanting for a while, but I have to admit that my thoughts remain a little inchoate. R is sufficiently metaprogrammable that I suspect a really thorough, really generic, out-of-the-box-works-everywhere reimplementation of dplyr/tidyr functionality would be impossible in Python. Right now my thoughts are that it may be easiest to emulate R's generic methods and try to ease the burden on those who want to hook up new APIs.

So, for example, a piped function like select would just (!) punt and call a select method which knew what to do with the object piped in. I'll have to think on something like group_by, though, which seems like it would have to be threaded through ... maybe a separate "groupable" object.

Sorry, this is probably coming off as babbling. I'll start working on this and we'll see if it comes to anything. Again, thank you for the encouragement!

zhaoxb10 commented 3 years ago

Can dfply work in sparks environment like sparklyr?