machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

Question: comparison with dfply #287

Open omrihar opened 3 years ago

omrihar commented 3 years ago

Good Morning,

I came across this project when I was searching for an up-to-date library that forks dplyr to python. I've been using dfply for a while (before starting to use dplyr directly in R), and I was looking for a library that implements pivot_wider and pivot_longer, since dfply does not implement it (and seems to be inactive at the moment).

Since this library seems to be quite close to dfply (more than to, say dplython), I was wondering what are some of the key differences between the two libraries? It seems that every few years another library pops that tries to port dplyr to python, which I guess is a difficult task, but it seemed to me that the dfply approach was already quite good - so maybe building on top of it would have been a good option?

Thank you for the effort of bringing some tidyverse goodness to python :)

machow commented 3 years ago

Hey @omrihar--thanks for checking out siuba! I tried to compare key differences in this key feature doc--does that cover your question?

In general, siuba goes deeper than previous ports in 3 ways:

I think all the ports to python are similar on the surface, but what siuba has tried to nail is the ability to execute on different backends (pandas, sql, down the line spark & dask). Because this was a focus from the beginning, this kind of extendability is a part of siuba's architecture :).

If you're interested in helping with implementing pivot_wider and pivot_longer, @breichholf contributed a PR (#238) with the bulk of pivot_longer. I think I dropped the ball there, but if you're interested in the pivot_ functions, I don't mind helping with it again!

omrihar commented 3 years ago

Hey @machow thanks for the quick reply! I saw the docs on the key feature but since it was compared with dplython rather than dfply, I was wondering if you were aware of that (I'm asking since it seems that dfply implemented a larger subset of dplyr than dplython, so is maybe a better benchmark).

Be that as it may, I'm quite interested in anything that will make data wrangling painless in python - so I'm quite interested in following siuba :) I also really like the idea of supporting SQL, Spark and Dask in the future! That's a very nice addition...

I found siuba while searching for something that implements the pivot_* functions, but actually I decided to manually create a dataframe that fits my needs exactly directly in pandas (maybe it's also because pivoting was always a bit confusing to me...). I'm not generally interested in this specific application, rather more interested in a general framework for "grammar of data" style libraries.

Thank you for the good work :) If I find somewhere I can contribute, I will definitely try to!

machow commented 3 years ago

I saw the docs on the key feature but since it was compared with dplython rather than dfply, I was wondering if you were aware of that

Ah, that's fair! I have a blog post draft sitting around comparing siuba to dplython, dfply, and plydata, so this is helpful to hear. I'll try to push it out in the next couple days! I think the main things missing in siuba are bind_rows/cols, row_slice, and sample.

RE pivot functions and creating what you need in pandas, there's a fairly in-depth discussion in #233 about what these would look like in pandas. From what I remember, pivot_longer is fairly straight forward to implement (just some kind of convoluted resetting of indexes). pivot_wider could be largely a wrapper around pandas .pivot_table method. The challenge is it's hard to do anything beyond what would be a simple values_fn arg in dplyr.