New grammar for data minpulation

Make42 commented 6 years ago

This is not an issue, but I am not sure how to contact someone from github and give some information. I just wanted to let you know that there is a new grammar for data manipulation: https://github.com/has2k1/plydata

A related question I would have is, how dfply related to it - they seem rather, right?

kieferk commented 6 years ago

Looks similar for sure. This is the first I've heard of this one. All of these packages (dfply, dplython, plydata) are python ports of the dplyr package so they are going to be pretty similar in syntax.

Make42 commented 6 years ago

https://github.com/coursera/pandas-ply would be another one. They (except the new plydata) are summarized in http://fastml.com/piping-in-r-and-in-pandas/ - also dfply is mentioned. I am not quite sure, what the "graph of inspiration" is. I think your project is the newest, inspired by dplython, right?

I am a bit baffled, that all this very similar projects pop up and I do not quite understand why the respective open source programmers do not collaborate instead... can you shed some light on that?

kieferk commented 6 years ago

OK I'll give you the breakdown, from my perspective at least. All of these packages are trying to port the incredible dplyr package from R to python. Because python is a significantly different language than R, people have taken different approaches to doing this (no lazy evaluation, for example, makes the porting less than trivial).

When I first started making this package the two I was aware of were pandas-ply and dplython. The latter was close to what I was hoping for, but it did not appear to be maintained actively anymore and I didn't like the fact that you were required to first convert your pandas DataFrame into the special DplyFrame object before piping would work.

Now it seems there are some even newer ones like plydata that I'm not super familiar with. Obviously I am biased, but I think that of all the options dfply is the "truest" to the dplyr syntax and is the most fleshed out. For example, doing something like >> select(starts_with('c')) is how it works in dplyr and only possible in dfply AFAIK.

As for why people aren't collaborating, I think it's mostly timing. The fact that dplython appeared dead inspired me to make my own. There's also some differences in opinion w/r/t syntax and how similar or different syntax is from 'dplyr'. On one extreme would be pandas-ply, which entirely forgoes the dplyr piping syntax and naming. On the other extreme is dfply that is as close to dplyr as possible.

Hope that helps clear things up!

Make42 commented 6 years ago

It does nicely. Thanks!

pandas-ply looks pretty dead to me, too, to be honest.

PS: Now you only have to change mask into filter ;-). This always bugs me.

kieferk commented 6 years ago

Can't change that one since filter is a standard and commonly-used function in Python!

I'll close this issue now, cheers.

has2k1 commented 6 years ago

@Make42, thank you for making me aware of dfply, I did not know about it. I have chosen to reply here instead of at has2k1/plydata#3 since I can do a little more than just acknowledge learning about dfply.

I knew about dplython when I created plydata, in fact I had thought about it long before and had a mock implementation with which I tried to influence the direction of dplython. I did not work and I could not adapt to dplython.

Specifically, I did not like the conversion to a special dataframe and I felt that the manager variable X was clunky. A string evaluation based implementation helps get around both issues.

Nonetheless, it does pain me to see a duplication of efforts in the open-source world. Concerning dfply and plydata, the efforts will probably go on since both are mature enough and have distinct design choices. That is, dfply goes for near syntax compatibility with dplyr e.g.

>> select(starts_with('c'))

Whereas plydata (>> notwithstanding) tries to be more "pythonic" e.g.

>> select(startswith='c')

Reason for the difference is, in pytthon str.startswith does not have an underscore and using a keyword argument reduces the number of variables in the global namespace.

Anyhow, if I had seen this project get off the ground I would have tried (with more zeal this time) to infect @kieferk with string evaluation ideas.

kieferk / dfply

New grammar for data minpulation #36