machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.15k stars 49 forks source link

rename should be able to take and apply a function to each column name #297

Open machow opened 3 years ago

machow commented 3 years ago

E.g.

from siuba.data import mtcars
from siuba import _, rename
mtcars >> rename(_.upper())

# equivalent to
mtcars.rename(columns = lambda _: _.upper())

The one challenge is deciding whether it should use built-in string methods, or pandas DataFrame.columns.str accessor methods.

Thinking about the whole tidyselect situation now, it could be nice to have select and rename use built in string methods. This would nicely correspond to the lambda above AND remove the dependency on pandas from siuba.sql.

dpavlic commented 3 years ago

Quick question on this. I was trying to get around this by doing something like so:

def clean_cols(df):
    return dict(
        zip(df.columns.str.upper().str.strip().str.replace(" ", "_"), df.columns)
    )

df >> rename(**clean_cols(_))

This, however, will not work because what's being passed is a Symbol that I have no idea how to actually evaluate. Is there a way to make this work or is this approach doomed to failure?

My very naive thinking was that the pipe works somewhat similarly to magrittr in that y(x, x.z) could be represented via x >> y(_.z); I was then thinking ok I actually have to evaluate the symbol in the function call, but it doesn't look like there's anything there to evaluate; is rename not set up to take "_" as an argument, or am I 1000% out to lunch in my thinking start to finish?

machow commented 3 years ago

is rename not set up to take "_" as an argument, or am I 1000% out to lunch in my thinking start to finish?

Hey @dpavlic! Thanks for resurfacing this--rename currently uses a very simple approach (with 4 lines of code!), and doesn't include functions (e.g. only allows rename(a = _.b)).

Another issue is that python eagerly evaluates, so calling cleancols on ``:

from siuba import _

def clean_cols(df):
    return dict(
        zip(df.columns.str.upper().str.strip().str.replace(" ", "_"), df.columns)
    )

clean_cols(_)

Fails, since the _ can't be iterated over.

# fails, for same reason zip(..., _) fails
for ii in _:
    pass

There is a breakdown of how _ works here in the docs, including how to build custom functions, similar to n() in dplyr.

What should rename functionality be?

I think it could be expanded pretty easily to include passing a function, but have been chewing on whether selection verbs should operate using built-in string methods, or the df.columns.str accessor methods.

Moreover, it looks like there's dplyr's rename_with() and select(where()) syntax to consider...

library(dplyr)

# rename_with takes a function that operates on a character vector
mtcars %>% rename_with(toupper)

# where takes function that operates on data
mtcars %>% select(where(is.numeric))

How to rename all right now?

Rename is a tough one, because it doesn't do much. So at the moment siuba is limited to this unsatisfying approach :/

import pandas as pd
from siuba import rename, pipe, _
from siuba.data import mtcars

mtcars >> rename(**{k.upper().strip().replace(" ", "_"): k  for k in mtcars.columns})

# or this kind of crazy approach
# note that pandas rename method must receive a callable or a dictionary
# so we can't use the df.columns.str approach
mtcars >> pipe(_.rename(columns = lambda s: s.upper().strip().replace(" ", "_")))

# proposed siuba syntax
# need to decide what the transform function uses under the hood, and what should be
# passed if transform function is a lambda (or other callable).
mtcars >> rename(  
          _.upper().strip().replace(" ", "_"),      # transform function
          _.startswith("m")                        # tidyselection
)

Implementing functionality like rename_with

If you have thoughts on what might be comparable in siuba to rename_with and select(where()), would love to hear them! I've been a little careful with fleshing this out, since dplyr has done a lot recently in this area, but it seems like it's work is pretty stable now :).

dpavlic commented 3 years ago

Thanks for the very in-depth response. Now that you explain it, of course you can't escape the eager evaluation in Python, ultimately falls back to the same issue that plagues the entire enterprise not being as ergonomic as what's possible in R :/

My completely uneducated $0.02 on future implementation is that it seems to me that if 'cloning' dplyr in R as closely as possible is your aim, then you should have a rename which can be pretty much as simple as it is now, and rename_with should function on built-in string methods. Technically, rename() in dplyr now accepts tidy selects but to be honest, I find this pattern more confusing than useful:

mtcars %>% dplyr::rename("NAME" = everything())

# resulting columns
# NAME1 NAME2 NAME3 NAME4 NAME5 NAME6 NAME7 NAME8 NAME9 NAME10 NAME11
# Looks more like a footgun than something anyone wants?

Going back to doing things with rename_with, given a random df, what you propose in your post seems highly comparable:

#R
df %>%
  rename_with(~ toupper(.x) %>% trimws() %>% str_replace_all(" ", "_")

# Siuba
df >>
  rename_with(lambda x: x.upper().strip().replace(" ", "_")

# And indeed this transformation looks lovely!
df >>
  rename_with(_.upper().strip().replace(" ", "_"))

Again, please feel free to ignore and thank you for your work.

machow commented 3 years ago

Ah, thanks for pointing out how tidyselection works in rename! Agreed it's likely not important!

I think I'm 99% on board. The one challenge is that df.columns.str has a contains method that's super useful, and siuba can't do...

from siuba import _

# uses __contains__ method, which must return a boolean
"a" in _

I think pandas str accessor api aims to be faithful to existing built in string methods though, so there may be some wiggle room there...

In any event, I appreciate the discussion--it's a small but haunting issue, so if you have any other thoughts, would love to hear them!

(Note that nothing would stop siuba from allowing people to import a contains function! Most dplyr vector functions live in siuba.dply.vector, but it could be weird to have mostly methods, and then just this one function 🤔)

dpavlic commented 3 years ago

Personally, I'm not remotely bothered by having most things be string methods, and then this one function be a function; especially since in the future there may be more string convenience functions you may want to add in the future.

machow commented 3 years ago

That's a good point, and maybe related to that, once I merge #308, siuba will have an ops module that is comprised of functions for each pandas series method. So that'd open up room for something like..

# or from siuba.ops.str import contains
from siuba.ops import str
from siuba import *

# equivalent to df.some_col.str.contains("b")
str.contains(df.some_col, "b")

# could be implemented
df >> select(str.contains(_, "b"))