pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.35k stars 169 forks source link

pyjanitor version of .assign() or mutate() #599

Open markfairbanks opened 5 years ago

markfairbanks commented 5 years ago

Currently there are a few different pyjanitor functions to deal with adding/modifying columns.

In general as I've been using pyjanitor I assumed there would be something along the lines of:

df.mutate('a_plus_b', lambda x: x.a + x.b)

Is there a way to either add lambda functionality to add_column() and transform_column(), or add a function like mutate() that does this?

ericmjl commented 5 years ago

@mtfairbanks thanks for raising these good points! This issue definitely is the first overview of the multitude of ways to add columns to a dataframe. Let me address your points one by one, while also taking the chance to identify the potential improvement opportunities to either the docs or the implementation.

I know that the mutate API is something idiomatic from the R world; I think it would be a nice alias to have in the library, and would help ease the transition for R users. As I wasn't an R user before, do you know what the behaviours of the mutate() function are in the R world? If so, would you be kind enough to list them here?

Also, would you be game for a PR or two? I have this goal to use contributing to pyjanitor as a low-barrier way to invite more people to make contributions to the open source world, and I'd definitely welcome your contributions too! :smile: We can talk about exact PR content if you're interested!

markfairbanks commented 5 years ago

@ericmjl I haven't done a PR request before, but I'm definitely open to it if you have some ideas on content!

FYI - I know the pandas API well enough to get the job done, but I don't think I would be able to help on the code side unfortunately. I've used R/tidyverse extensively the past few years, but I've only used Python on the odd project.

mutate() has a lot in common with .assign(), but the main difference is R-tidyverse always use bare column names without the need for lambda x. In .assign() the use of bare names for new column names as opposed to quotes seems odd for a python package. (Not sure if this is a common opinion.)

mutate() in R has a few main behaviors:

Example:

library(tidyverse)

read_csv("https://tinyurl.com/quick-mtcars-csv") %>%
  select(mpg, cyl) %>%
  mutate(mpg_edit = mpg * 2,
         mpg_edit = mpg_edit - 5, # Can edit the newly created "mpg_edit" column
         mpg_edit_plus_cyl = mpg_edit + cyl)
import pandas as pd
import janitor

(pd.read_csv('https://tinyurl.com/quick-mtcars-csv')
 .select_columns(['mpg', 'cyl'])
 .assign(mpg_edit = lambda x: x.mpg * 2,
         # mpg_edit = lambda x: x.mpg_edit - 5 # Can't do this in python
         mpg_edit_plus_cyl = lambda x: x.mpg_edit + x.cyl)
)

In the end I'm not positive if it's better to just use .assign() or if a better API is possible, but it seems like this is the package with the most headway towards trying to make cleaner functions for pandas data frames. As far as pyjanitor is concerned, I'm also not sure if it's necessary to be able to create multiple columns in one mutate(), it seems like calling mutate() multiple times would be fine.

ericmjl commented 5 years ago

@mtfairbanks thanks for the detail up there! Let me see, I think I have some ideas.

FYI - I know the pandas API well enough to get the job done, but I don't think I would be able to help on the code side unfortunately.

No worries, we can start with modifying the docs. It's lightweight, doesn't affect the API, and you can't break the code :smile:.

In terms of PR content, here's one that I have in mind - to modify the join_apply docstring such it explains how the function is supposed to read. Something like modifying the 1st line in the docstring to make it clearer.

Just that one would be a good first PR, and as you start to feel more comfortable, we can build towards more substantial documentation PRs or start code PRs.

In the end I'm not positive if it's better to just use .assign() or if a better API is possible

The behaviours that you've descirbed for mutate seem to have the building blocks available in pyjanitor. We might be able to do this thing where calling mutate with the right kwargs will dispatch to the appropriate function that pyjanitor uses, and hence give us a way to add the mutate API to pandas.

samihamdan commented 4 years ago

Background

So I hope I don't take the 'available for hacking' label too seriously, but I have some ideas/a prototype. My idea of the mutate method deviates a bit from the previously mentioned syntax, so I will try to explain why I think that this version would be great.

As mentioned before .assign() can basically do everything wished by the original post. The only problem would be the necessity to use .assign multiple times. Nevertheless, I guess that there is potential for creating a .mutate(), which is closer too R's version and might help people to migrate without actually changing this necessity. It might even provide a simpler/maybe cleaner syntax to pythonistas (without the lambda boilerplate).

Proposed syntax

(pd.read_csv('https://tinyurl.com/quick-mtcars-csv')
 .select_columns(['mpg', 'cyl'])
 .mutate(mpg_edit = "mpg * 2 - 5" ,
         mpg_edit_plus_cyl = "mpg_edit + cyl")

Basically, I suggest to just assign a string (containing an expression) to a variable (name of new column). All column names inside of the strings are replaced by a dataframe.column_name plus the assignment to the new column_name and the lambda boilerplate before this string ("column_name = lambda dataframe: ...."). This new string is than evaluated using python's .eval() (Later, I will discuss why I think that I can use .eval() here).

Of course, the syntax inside of the string should consider some problems (which can be solved by a pandas.eval() and pandas.query() like syntax):

Now, the syntax can look like this:

df = pd.DataFrame({'a':[1,2,3,4], 'b': [5,6,7,8], 'c O l': ["a","b", "c", "d"]})
(df
   .mutate(c = 'a+b',  # easy syntax still works
           d = '`c O l` + "d"*2', # column_name with white space 
           mean = 'a.mean() + b.mean() + c.mean()', 
           e = 'a.@mean()') # because we assigned a mean column we have to use @
)

In other words, using this implementation we can use normal python, still work with non tidy column_names and use objects (e.g. functions) with the same name as a column in our df. Note that this implementation can also not assign to the same column twice inside of one .mutate() call. I would argue that it is ok too use multiple .mutate() calls for such an occasion (but I am open for suggestions) . Also, note that I decided against just wrapping pd.eval(), because it seems to be (only) optimized for number operations and does not handle string operations as the syntax above.

Problems?

  1. This implementation only works with ordered **kwargs, but from how I understand it pyjanitor only supports 3.6 and above, which all have them (as far as I know).
  2. Using the evil .eval() Using .eval() is usually not very pythonic and can lead to security problems.
    • security: First I want to stress that I am not an expert in this topic and only describe how I see it. So from my understanding .eval() is only a security problem when exposed to third parties. It allows the injection of malicious code. To be honest, I don't see how this is a problem for a tool aimed at (Data) Scientists etc. who enter the evaluated strings themselves. Still, it might be good to add a user warning for people considering to pipe other user's input into such a .mutate(), which can be considered as a security problem and is not the intended use case.
    • not pythonic: Here, I would just refer to Raymond Hettinger - Beyond PEP 8 (https://youtu.be/wf-BqAjZb8M?t=2920). Sometimes .eval() is the best/only option, but I am open for trying other things I might have missed.

If you agree that such a .mutate() method might be helpful I can adjust my code to the PR rules and then PR it.

PS: If it is really important to have the ability to use one column name twice inside of the same .mutate(), I can change the *kwargs to args, where the user inputs Tuples of ('col_name', 'assignment_string').

The resulting syntax would look like this:

import pandas as pd
import janitor

(pd.read_csv('https://tinyurl.com/quick-mtcars-csv')
 .select_columns(['mpg', 'cyl'])
 .assign(('mpg_edit', 'mpg * 2'),
         ('mpg_edit' , 'mpg_edit - 5'),  # this would be new
         ('mpg_edit_plus_cyl',  'mpg_edit + cyl'))
)

Everything else would be as previously mentioned (including \@ and ` notations). Personally, I prefer the non Tuple version, because the syntax seems clearer/cleaner too me.

markfairbanks commented 4 years ago

@samihamdan The original request came since as a R user/developer, I didn't understand why .assign() didn't use a string to name a new column.

For example, this is how I thought .assign() should work:

df.assign('mpg_edit' =  lambda x: x.mpg * 2)

As opposed to the actual syntax of:

df.assign(mpg_edit =  lambda x: x.mpg * 2)

It seemed "unpythonic" to use an unquoted name. As I've been using python more, I now realize this is a result of using **kwargs, much like using ... in R.

I think using .eval(), though pretty slick, could lead to some unnecessary confusion from the user.

My thought to @ericmjl would be to close this issue and let users just use .assign(), and just chalk this up to an R user misunderstanding the internals of python!