pyjanitor version of .assign() or mutate()

Currently there are a few different pyjanitor functions to deal with adding/modifying columns.

add_column() allows you to add a column, but doesn't accept lambda functions like join_apply()
transform_column() allows you to transform an existing column, but also doesn't accept lambda functions.
join_apply() accepts lambda functions, but doesn't allow you to transform an existing column. It also has an API that is slightly different from all other column editing functions. join_apply() uses (func, column_name) as opposed to (column_name, func).

In general as I've been using pyjanitor I assumed there would be something along the lines of:

df.mutate('a_plus_b', lambda x: x.a + x.b)

Is there a way to either add lambda functionality to add_column() and transform_column(), or add a function like mutate() that does this?

@mtfairbanks thanks for raising these good points! This issue definitely is the first overview of the multitude of ways to add columns to a dataframe. Let me address your points one by one, while also taking the chance to identify the potential improvement opportunities to either the docs or the implementation.

add_column's intent is to add a pre-computed value to a new column in the dataframe. This should probably be documented. For example, adding in one value across all rows, or adding in an iterable of values of length equal to the dataframe. (I think we should add a "cycle" option too, if it's not there!)
transform_column's intent is to transform a single column by a function. Because lambda functions are functions, I think it it will accept lambda functions that take in a single input. This, however, should be better documented.
join_apply's intent is to do what I think you want for mutate. We can do df.join_apply(lambda x: x["a"] + x["b"], "a_plus_b"). (It's supposed to be read as "join the application of this function on the dataframe into a new column", and that reading probably should be documented as well.)

I know that the mutate API is something idiomatic from the R world; I think it would be a nice alias to have in the library, and would help ease the transition for R users. As I wasn't an R user before, do you know what the behaviours of the mutate() function are in the R world? If so, would you be kind enough to list them here?

Also, would you be game for a PR or two? I have this goal to use contributing to pyjanitor as a low-barrier way to invite more people to make contributions to the open source world, and I'd definitely welcome your contributions too! :smile: We can talk about exact PR content if you're interested!

@ericmjl I haven't done a PR request before, but I'm definitely open to it if you have some ideas on content!

FYI - I know the pandas API well enough to get the job done, but I don't think I would be able to help on the code side unfortunately. I've used R/tidyverse extensively the past few years, but I've only used Python on the odd project.

mutate() has a lot in common with .assign(), but the main difference is R-tidyverse always use bare column names without the need for lambda x. In .assign() the use of bare names for new column names as opposed to quotes seems odd for a python package. (Not sure if this is a common opinion.)

mutate() in R has a few main behaviors:

Create a new column (which is always added to the end)
Edit an existing column
Can make multiple columns in one mutate() call
Can refer to a newly made column within the same mutate() call
Can redo a newly made column within the same mutate() call (which you can't do in .assign())

Example:

library(tidyverse)

read_csv("https://tinyurl.com/quick-mtcars-csv") %>%
  select(mpg, cyl) %>%
  mutate(mpg_edit = mpg * 2,
         mpg_edit = mpg_edit - 5, # Can edit the newly created "mpg_edit" column
         mpg_edit_plus_cyl = mpg_edit + cyl)

import pandas as pd
import janitor

(pd.read_csv('https://tinyurl.com/quick-mtcars-csv')
 .select_columns(['mpg', 'cyl'])
 .assign(mpg_edit = lambda x: x.mpg * 2,
         # mpg_edit = lambda x: x.mpg_edit - 5 # Can't do this in python
         mpg_edit_plus_cyl = lambda x: x.mpg_edit + x.cyl)
)

In the end I'm not positive if it's better to just use .assign() or if a better API is possible, but it seems like this is the package with the most headway towards trying to make cleaner functions for pandas data frames. As far as pyjanitor is concerned, I'm also not sure if it's necessary to be able to create multiple columns in one mutate(), it seems like calling mutate() multiple times would be fine.

@mtfairbanks thanks for the detail up there! Let me see, I think I have some ideas.

FYI - I know the pandas API well enough to get the job done, but I don't think I would be able to help on the code side unfortunately.

No worries, we can start with modifying the docs. It's lightweight, doesn't affect the API, and you can't break the code :smile:.

In terms of PR content, here's one that I have in mind - to modify the join_apply docstring such it explains how the function is supposed to read. Something like modifying the 1st line in the docstring to make it clearer.

Just that one would be a good first PR, and as you start to feel more comfortable, we can build towards more substantial documentation PRs or start code PRs.

In the end I'm not positive if it's better to just use .assign() or if a better API is possible

The behaviours that you've descirbed for mutate seem to have the building blocks available in pyjanitor. We might be able to do this thing where calling mutate with the right kwargs will dispatch to the appropriate function that pyjanitor uses, and hence give us a way to add the mutate API to pandas.

Background

So I hope I don't take the 'available for hacking' label too seriously, but I have some ideas/a prototype. My idea of the mutate method deviates a bit from the previously mentioned syntax, so I will try to explain why I think that this version would be great.

As mentioned before .assign() can basically do everything wished by the original post. The only problem would be the necessity to use .assign multiple times. Nevertheless, I guess that there is potential for creating a .mutate(), which is closer too R's version and might help people to migrate without actually changing this necessity. It might even provide a simpler/maybe cleaner syntax to pythonistas (without the lambda boilerplate).

Proposed syntax

(pd.read_csv('https://tinyurl.com/quick-mtcars-csv')
 .select_columns(['mpg', 'cyl'])
 .mutate(mpg_edit = "mpg * 2 - 5" ,
         mpg_edit_plus_cyl = "mpg_edit + cyl")

Basically, I suggest to just assign a string (containing an expression) to a variable (name of new column). All column names inside of the strings are replaced by a dataframe.column_name plus the assignment to the new column_name and the lambda boilerplate before this string ("column_name = lambda dataframe: ...."). This new string is than evaluated using python's .eval() (Later, I will discuss why I think that I can use .eval() here).

Of course, the syntax inside of the string should consider some problems (which can be solved by a pandas.eval() and pandas.query() like syntax):

What if we have column names with special characters or white space <= To solve this, I added that words wrapped in ` ` (and not inside of substrings) are always interpreted as column names
What if we want to use a function/variable/method which name is also a column_name in our df. <= To solve this, I added an \@object/func/var (and not inside of substrings) which is never interpreted as a column_name.

Now, the syntax can look like this:

df = pd.DataFrame({'a':[1,2,3,4], 'b': [5,6,7,8], 'c O l': ["a","b", "c", "d"]})
(df
   .mutate(c = 'a+b',  # easy syntax still works
           d = '`c O l` + "d"*2', # column_name with white space 
           mean = 'a.mean() + b.mean() + c.mean()', 
           e = 'a.@mean()') # because we assigned a mean column we have to use @
)

In other words, using this implementation we can use normal python, still work with non tidy column_names and use objects (e.g. functions) with the same name as a column in our df. Note that this implementation can also not assign to the same column twice inside of one .mutate() call. I would argue that it is ok too use multiple .mutate() calls for such an occasion (but I am open for suggestions) . Also, note that I decided against just wrapping pd.eval(), because it seems to be (only) optimized for number operations and does not handle string operations as the syntax above.

Problems?

This implementation only works with ordered **kwargs, but from how I understand it pyjanitor only supports 3.6 and above, which all have them (as far as I know).
Using the evil .eval() Using .eval() is usually not very pythonic and can lead to security problems.
- security: First I want to stress that I am not an expert in this topic and only describe how I see it. So from my understanding .eval() is only a security problem when exposed to third parties. It allows the injection of malicious code. To be honest, I don't see how this is a problem for a tool aimed at (Data) Scientists etc. who enter the evaluated strings themselves. Still, it might be good to add a user warning for people considering to pipe other user's input into such a .mutate(), which can be considered as a security problem and is not the intended use case.
- not pythonic: Here, I would just refer to Raymond Hettinger - Beyond PEP 8 (https://youtu.be/wf-BqAjZb8M?t=2920). Sometimes .eval() is the best/only option, but I am open for trying other things I might have missed.

If you agree that such a .mutate() method might be helpful I can adjust my code to the PR rules and then PR it.

PS: If it is really important to have the ability to use one column name twice inside of the same .mutate(), I can change the *kwargs to args, where the user inputs Tuples of ('col_name', 'assignment_string').

The resulting syntax would look like this:

import pandas as pd
import janitor

(pd.read_csv('https://tinyurl.com/quick-mtcars-csv')
 .select_columns(['mpg', 'cyl'])
 .assign(('mpg_edit', 'mpg * 2'),
         ('mpg_edit' , 'mpg_edit - 5'),  # this would be new
         ('mpg_edit_plus_cyl',  'mpg_edit + cyl'))
)

Everything else would be as previously mentioned (including \@ and ` notations). Personally, I prefer the non Tuple version, because the syntax seems clearer/cleaner too me.

@samihamdan The original request came since as a R user/developer, I didn't understand why .assign() didn't use a string to name a new column.

For example, this is how I thought .assign() should work:

df.assign('mpg_edit' =  lambda x: x.mpg * 2)

As opposed to the actual syntax of:

df.assign(mpg_edit =  lambda x: x.mpg * 2)

It seemed "unpythonic" to use an unquoted name. As I've been using python more, I now realize this is a result of using **kwargs, much like using ... in R.

I think using .eval(), though pretty slick, could lead to some unnecessary confusion from the user.

My thought to @ericmjl would be to close this issue and let users just use .assign(), and just chalk this up to an R user misunderstanding the internals of python!

pyjanitor-devs / pyjanitor