Open data-stepper opened 9 months ago
We have a solution for the lambdas within assign in the pipeline, so mutate is unlikely to get added
cc @MarcoGorelli
Seems interesting but unless I am misunderstanding, these two things contradict one another
We could easily add this functionality by providing a context manager (perhaps pd.mutate, to follow R's naming here) which temporarily moves all columns of a dataframe into the caller's locals, allows the caller to create new pd.Series while calling and then (upon the context manager's exit) all those new pd.Series (or the modified old ones) could be formed to a data frame again.
# Set the entire column to one value
c = 10
So any constants assigned in scope of the context manager become columns also? Or is it only assignments where the RHS is a pandas.Series
instance? Your example reads like the former, which I think is less than ideal because then can't use intermediate variables within the context manager unless we explicit call del
on them to remove from the scope. Given explicit use of del
is pretty rare and some may consider unpythonic, I think that's a going to hold this back greatly.
I don't think they necessarily contradict each other, because:
_some_constant
(i.e. by prefixing it with an underscore), this also fits in nicely with the notion of "private" functions in python.pd.mutate(df, reset_locals=False, convert_non_series_to_column=True)
where the latter would allow constants to be assigned to series that have the constant value (as in .assign
),pd.Series
. Remember that the scope the context manager uses does not include locals defined in functions which are called from within that scope. As an example:import pandas as pd
df: pd.DataFrame = <some df>
some_constant: int = 42
def make_new_col(df) -> pd.Series:
# This will of course not be assigned to the dataframe
locally_defined_constant: int = 101
return df['new_col'] * locally_defined_constant
with pd.mutate(df):
# Assigning a pd.Series directly adds it to the columns
new_col = pd.Series(["a", "b", ...])
constant_series = 42 # This will become a pd.Series having the constant value 42
other_col = make_new_col(df)
And @phofl , I think this context manager is not meant to replace .assign
but rather provide a new, more easily debuggable and clean way of creating new columns in dataframes.
Further, I think this would also fit nicely in with pandas' notion of series and dataframes and users could more natively modify a dataframe's "scope".
Also, I think a kwarg
like the reset_locals
would make sense to users as they then actually enter the "dataframes scope", but we might want to set it to False
by default. If this is set, previous locals()
would be removed to cleanly have only the dataframe's columns, and df
of course, in the locals()
.
We would just have to make sure to clean up the newly assigned columns from the locals()
scope so users don't mix up their scope when using this context manager.
I don't think they necessarily contradict each other, because:
- Users could define locals they do not want to be added as columns via
_some_constant
(i.e. by prefixing it with an underscore), this also fits in nicely with the notion of "private" functions in python.- We could add a keyword argument to the context manager to allow
pd.mutate(df, reset_locals=False, convert_non_series_to_column=True)
where the latter would allow constants to be assigned to series that have the constant value (as in.assign
),- People could either define constants before the context manager or within functions that could be called from within the context manager and return the desired
pd.Series
. Remember that the scope the context manager uses does not include locals defined in functions which are called from within that scope. As an example:import pandas as pd df: pd.DataFrame = <some df> some_constant: int = 42 def make_new_col(df) -> pd.Series: # This will of course not be assigned to the dataframe locally_defined_constant: int = 101 return df['new_col'] * locally_defined_constant with pd.mutate(df): # Assigning a pd.Series directly adds it to the columns new_col = pd.Series(["a", "b", ...]) constant_series = 42 # This will become a pd.Series having the constant value 42 other_col = make_new_col(df)
Ah, thank you for the clarification. I for one find .assign
to be verbose at times, particular when I want to make a new column that depends on an another one and would invole doing a .apply
(nested lambda yuck!). My other concern was the magicalness of it, but after reading the implementation details of .query
to support the use of @
for injecting locals, I don't think its that big of a deal.
I like the proposal more now.
Yes, I agree this context manager would be slightly 'hacky' but I think it would boost productivity and improve code readability much more, which I think is more important. Another great advantage would be debugging.
Imagine a user sets a breakpoint the first line within the context manager. If we were to clean up previous locals()
, that user would then get a clear overview seeing all columns as pd.Series
in the debugger's variable explorer.
This makes it much more obvious and explicit to understand column assignment in detail than using (1) chained .assign
statements (which are basically impossible to debug) or (2) any function that repeatedly calls df['new_col'] = ...
, which is easier to debug but even in this case, the user would only see created columns by viewing the df
's columns explicitly rather than seeing them implicitly in the debugger's variable explorer.
Hey, so is there any chance this feature will be added to pandas?
I think it would be a great benefit for the package as a whole as it makes code more readable and easier to understand. It would also make pandas a more attractive choice compared to R in feature engineering that usually involves many (chained) assigns.
Would be great to get some feedback!
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
I wish feature engineering (i.e. creating new columns from old ones) could be more efficient and convenient in pandas. Mainly, common ways of adding features to dataframes in pandas include
.assign
statements (which are hard to debug and contain many hard-to-read lambda expressions) ordf['new_column'] = ...
repeatedly in someadd_features
function, this is better for debugging purposes but also hard to read and inconvenient as the user always has to type quotes and the worddf
.In R's
mutate
function, the series are accessible directly from the scope which makes code much more readable (debugging in R is something else to discuss).Feature Description
We could easily add this functionality by providing a context manager (perhaps pd.mutate, to follow R's naming here) which temporarily moves all columns of a dataframe into the caller's
locals
, allows the caller to create new pd.Series while calling and then (upon the context manager's exit) all those new pd.Series (or the modified old ones) could be formed to a data frame again.This makes feature engineering much more convenient, efficient and likely also more debuggable that using chained .assign statements (in the debugger, one could directly access all the pd.Series in that scope). A minimal example implementation could look like the following:
The drawback of this feature is that we are fiddling with the caller's
locals
which is not the most elegant. However, I believe that feature engineering like this is much better to debug and makes the code more readable (than using chained.assign
s or repeatedly callingdf['new_feature'] = 2 * df['old_feature'] ** 2
).Therefore I think this feature would make life easier and pandas more useful (and users faster) in data science tasks.
Alternative Solutions
One might want to handle the
locals
better here to make the usage of this feature less error-prone. Perhaps one would want to cache previouslocals
and then only have the dataframe's columns as the locals in the caller's scope.This would make debugging even more clean, because if a user sets a breakpoint in such a
with pd.mutate
statement, then that user sees all the columns in the scope's locals clearly instead of having to inspect the dataframe's columns values in the debugger.Additional Context
No response