pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.79k forks source link

ENH: pandas mutate, add R's mutate functionality to enable users to easily create new columns in data frames #56499

Open data-stepper opened 9 months ago

data-stepper commented 9 months ago

Feature Type

Problem Description

I wish feature engineering (i.e. creating new columns from old ones) could be more efficient and convenient in pandas. Mainly, common ways of adding features to dataframes in pandas include

  1. using chained .assign statements (which are hard to debug and contain many hard-to-read lambda expressions) or
  2. calling df['new_column'] = ... repeatedly in some add_features function, this is better for debugging purposes but also hard to read and inconvenient as the user always has to type quotes and the word df.

In R's mutate function, the series are accessible directly from the scope which makes code much more readable (debugging in R is something else to discuss).

Feature Description

We could easily add this functionality by providing a context manager (perhaps pd.mutate, to follow R's naming here) which temporarily moves all columns of a dataframe into the caller's locals, allows the caller to create new pd.Series while calling and then (upon the context manager's exit) all those new pd.Series (or the modified old ones) could be formed to a data frame again.

This makes feature engineering much more convenient, efficient and likely also more debuggable that using chained .assign statements (in the debugger, one could directly access all the pd.Series in that scope). A minimal example implementation could look like the following:


# %%

from multiprocessing import context
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "col_a": [1, 2, 3, 4, 5],
        "col_b": [10, 20, 30, 40, 50],
    }
)

df

# %%

import inspect
from contextlib import contextmanager
from copy import deepcopy

class mutate_df:
    def __init__(self, df: pd.DataFrame):
        self.df = df

    def __enter__(self):
        frame = inspect.currentframe().f_back
        self.scope_keys = deepcopy(self._extract_locals_keys_from_frame(frame))

        for col in self.df.columns:
            if col in self.scope_keys:
                # Maybe give a warning here?
                pass

            frame.f_locals[col] = self.df[col]

    def _extract_locals_keys_from_frame(self, frame):
        s = {
            str(key)
            for key in frame.f_locals.keys()
            if not key.startswith("_")
        }
        return s

    def __exit__(self, exc_type, exc_val, exc_tb):
        if exc_type:
            raise exc_type(exc_val)

        frame = inspect.currentframe().f_back
        current_keys = self._extract_locals_keys_from_frame(frame)

        added_keys = current_keys - self.scope_keys
        added_keys = set.union(added_keys, set(self.df.columns))

        for key in added_keys:
            try:
                val = frame.f_locals[key]
                self.df[key] = val

            except:
                pass

        return True

with mutate_df(df):
    # All of df's columns are available in the scope
    # as pd.Series objects

    # Set the entire column to one value
    c = 10
    # Use columns defined previously
    col_c = col_a * 20

    # Create new columns
    rolling_mean = col_b.rolling(2).mean()
    col_b_cumsum = col_b.cumsum()

df

# %%

The drawback of this feature is that we are fiddling with the caller's locals which is not the most elegant. However, I believe that feature engineering like this is much better to debug and makes the code more readable (than using chained .assigns or repeatedly calling df['new_feature'] = 2 * df['old_feature'] ** 2).

Therefore I think this feature would make life easier and pandas more useful (and users faster) in data science tasks.

Alternative Solutions

One might want to handle the locals better here to make the usage of this feature less error-prone. Perhaps one would want to cache previous locals and then only have the dataframe's columns as the locals in the caller's scope.

This would make debugging even more clean, because if a user sets a breakpoint in such a with pd.mutate statement, then that user sees all the columns in the scope's locals clearly instead of having to inspect the dataframe's columns values in the debugger.

Additional Context

No response

phofl commented 9 months ago

We have a solution for the lambdas within assign in the pipeline, so mutate is unlikely to get added

cc @MarcoGorelli

Delengowski commented 9 months ago

Seems interesting but unless I am misunderstanding, these two things contradict one another

We could easily add this functionality by providing a context manager (perhaps pd.mutate, to follow R's naming here) which temporarily moves all columns of a dataframe into the caller's locals, allows the caller to create new pd.Series while calling and then (upon the context manager's exit) all those new pd.Series (or the modified old ones) could be formed to a data frame again.

    # Set the entire column to one value
    c = 10

So any constants assigned in scope of the context manager become columns also? Or is it only assignments where the RHS is a pandas.Series instance? Your example reads like the former, which I think is less than ideal because then can't use intermediate variables within the context manager unless we explicit call del on them to remove from the scope. Given explicit use of del is pretty rare and some may consider unpythonic, I think that's a going to hold this back greatly.

data-stepper commented 9 months ago

I don't think they necessarily contradict each other, because:

  1. Users could define locals they do not want to be added as columns via _some_constant (i.e. by prefixing it with an underscore), this also fits in nicely with the notion of "private" functions in python.
  2. We could add a keyword argument to the context manager to allow pd.mutate(df, reset_locals=False, convert_non_series_to_column=True) where the latter would allow constants to be assigned to series that have the constant value (as in .assign),
  3. People could either define constants before the context manager or within functions that could be called from within the context manager and return the desired pd.Series. Remember that the scope the context manager uses does not include locals defined in functions which are called from within that scope. As an example:
import pandas as pd

df: pd.DataFrame = <some df>
some_constant: int = 42

def make_new_col(df) -> pd.Series:
    # This will of course not be assigned to the dataframe
    locally_defined_constant: int = 101

    return df['new_col'] * locally_defined_constant

with pd.mutate(df):
    # Assigning a pd.Series directly adds it to the columns
    new_col = pd.Series(["a", "b", ...])

    constant_series = 42 # This will become a pd.Series having the constant value 42

    other_col = make_new_col(df)

And @phofl , I think this context manager is not meant to replace .assign but rather provide a new, more easily debuggable and clean way of creating new columns in dataframes.

Further, I think this would also fit nicely in with pandas' notion of series and dataframes and users could more natively modify a dataframe's "scope". Also, I think a kwarg like the reset_locals would make sense to users as they then actually enter the "dataframes scope", but we might want to set it to False by default. If this is set, previous locals() would be removed to cleanly have only the dataframe's columns, and df of course, in the locals().

We would just have to make sure to clean up the newly assigned columns from the locals() scope so users don't mix up their scope when using this context manager.

Delengowski commented 9 months ago

I don't think they necessarily contradict each other, because:

  1. Users could define locals they do not want to be added as columns via _some_constant (i.e. by prefixing it with an underscore), this also fits in nicely with the notion of "private" functions in python.
  2. We could add a keyword argument to the context manager to allow pd.mutate(df, reset_locals=False, convert_non_series_to_column=True) where the latter would allow constants to be assigned to series that have the constant value (as in .assign),
  3. People could either define constants before the context manager or within functions that could be called from within the context manager and return the desired pd.Series. Remember that the scope the context manager uses does not include locals defined in functions which are called from within that scope. As an example:
import pandas as pd

df: pd.DataFrame = <some df>
some_constant: int = 42

def make_new_col(df) -> pd.Series:
    # This will of course not be assigned to the dataframe
    locally_defined_constant: int = 101

    return df['new_col'] * locally_defined_constant

with pd.mutate(df):
    # Assigning a pd.Series directly adds it to the columns
    new_col = pd.Series(["a", "b", ...])

    constant_series = 42 # This will become a pd.Series having the constant value 42

    other_col = make_new_col(df)

Ah, thank you for the clarification. I for one find .assign to be verbose at times, particular when I want to make a new column that depends on an another one and would invole doing a .apply (nested lambda yuck!). My other concern was the magicalness of it, but after reading the implementation details of .query to support the use of @ for injecting locals, I don't think its that big of a deal.

I like the proposal more now.

data-stepper commented 9 months ago

Yes, I agree this context manager would be slightly 'hacky' but I think it would boost productivity and improve code readability much more, which I think is more important. Another great advantage would be debugging.

Imagine a user sets a breakpoint the first line within the context manager. If we were to clean up previous locals(), that user would then get a clear overview seeing all columns as pd.Series in the debugger's variable explorer.

This makes it much more obvious and explicit to understand column assignment in detail than using (1) chained .assign statements (which are basically impossible to debug) or (2) any function that repeatedly calls df['new_col'] = ..., which is easier to debug but even in this case, the user would only see created columns by viewing the df's columns explicitly rather than seeing them implicitly in the debugger's variable explorer.

data-stepper commented 3 months ago

Hey, so is there any chance this feature will be added to pandas?

I think it would be a great benefit for the package as a whole as it makes code more readable and easier to understand. It would also make pandas a more attractive choice compared to R in feature engineering that usually involves many (chained) assigns.

Would be great to get some feedback!