mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.39k stars 1.91k forks source link

RFC: variable expressions in Plot #3107

Open mwaskom opened 1 year ago

mwaskom commented 1 year ago

I would like to add functionality to assign variables using an expression that has the source data as a namespace. This will provide functionality similar to R's nonstandard evaluation. Because Python does not have this concept, the space of possible approaches is less magical than what you can get in R (although it also avoids all of the complications that nonstandard evaluation introduces).

There's a few options here and I'm undecided as to what would be best:

As a motivating example, say we want to use the tip rate (tip / total_bill) in the tips dataset.

1. Function that accepts a dataframe and returns a series

so.Plot(tips, x=lambda d: d["tip"] / d["total_bill"])

(or in some cases)

so.Plot(tips, x=lambda d: d.tip / d.total_bill)

(+) Explicit and relatively easy to explain (+) Can pass a closure over other objects in the outer scope (+) Doesn't assume that data is a pandas.DataFrame (–) Repetitive and a little clumsy (–) Hard to get a nicely-formatted name (i.e. for axis labels) (–) Would be impossible to serialize

2. Custom object that wraps an expression passed to DataFrame.eval

so.Plot(tips, x=so.Expr("tip / total_bill"))

(+) Less repetitive (data is implicit) (+) Can get a nice name (+) Could be serialized (–) Introduces a new type of seaborn object that's a little hard to explain (–) Somewhat verbose (–) Programming in strings means linters won't work

3. Lambda that returns an expression passed to DataFrame.eval

so.Plot(tips, x=lambda: "tip / total_bill")

(+) Least verbose (+) Can get a nice name (+) Could support serialization with some extra internal handling (–) Abuses the purpose of lambdas and may be confusing (–) Programming in strings means linters won't work

mwaskom commented 1 year ago

xref #3053 as this may also bear on potential solutions for that usecase

BrianLandry commented 1 year ago

Another possible approach is to use lazy evaluation like siuba does.

That would result in code that looks like this:

so.Plot(tips, x=_.tip / _.total_bill))

Its basically the lambda function approach, but sprinkle in some magic to make it less repetitive and clumsy.

In general, I'm not a fan of the string-based approaches due to the lack of linting. Although I would prefer 2 over 3. I feel like an explicit new concept is better than the implicit new concept the lambda abuse introduces.

mwaskom commented 1 year ago

Eh I don’t think that underscore trick makes sense here, you need to explicitly import (and not overwrite) it and while that maybe makes sense if you’re going to build a whole framework around it, it’s not a good fit for an occasional feature like this one.

philsheard commented 1 year ago

From a usability perspective I'd favour either 1 or 3. But 1 is verbose enough that I'm more likely to do that transform directly on the DF and then pass it to the plotting func. So I'd say 3 offers something uniquely appealing in its brevity and 'magic'.

mwaskom commented 1 year ago

Also my inclination on the linting question is that the large majority of seaborn usage happens interactively so it's less of an issue — if you're going to get a syntax error, you'll get it immediately instead of having to "wait till runtime".

jcmkk3 commented 1 year ago

This is a tricky one. Dataframe APIs run into the same quandary about how to do this. I was hoping that the dataframe-api would come up with a standardized column expression API. Maybe it will still happen, but I haven't seen any discussion around it yet.

Many of the newer dataframe APIs (the ones that don't mimic pandas), have column expressions that can be accepted almost anywhere like polars and ibis. These feel ideal to me for use cases like with seaborn, but probably don't really make sense without a standard.

My choice of those given would be the first one. It is already pretty common in fluent-style pandas usage and it could work with other supported dataframe libraries like you said.

One option as an alternative/addition to the 2nd option would be to allow the columns to be listed out and then the columns could be provided as arguments to the function in the same order.

so.Plot(tips, x=so.Expr("tip / total_bill"))
so.Plot(tips, x=so.Expr(["tip", "total_bill"], lambda a, b: a / b))
mwaskom commented 1 year ago

I'd also considered

so.Plot(tips, x=lambda tip, total_bill: tip / total_bill)

Where inspect.signature could be used to identify the columns (so a less-verbose version of this new suggestion).

But when writing up this issue was having trouble articulating the case for this over option 1.

jcmkk3 commented 1 year ago

But when writing up this issue was having trouble articulating the case for this over option 1.

Yeah. Being able to pass in the columns as arguments to the lambda is mostly helpful if you're able to use short variable names to write what feels like more mathematical formulas. It is especially useful if you're reusing the same variable multiple times in the formula. It could always be something that could be a helper to create a function compatible with the 1st option, anyway.

An example of where something like that would come in handy would be the skew calculation in the arquero example below. Mind you, that is just destructing syntax in javascript and isn't something that was created specifically for arquero.

// Reshape (fold) the data to a two column layout: city, sun.
dt.fold(aq.all(), { as: ['city', 'sun'] })
  .groupby('city')
  .rollup({
    min:  d => op.min(d.sun), // functional form of op.min('sun')
    max:  d => op.max(d.sun),
    avg:  d => op.average(d.sun),
    med:  d => op.median(d.sun),
    // functional forms permit flexible table expressions
    skew: ({sun: s}) => (op.mean(s) - op.median(s)) / op.stdev(s) || 0
  })
  .objects()
jcmkk3 commented 1 year ago

There may be other ideas in some other dataframe libraries, but I think that they've mostly all been covered here.

FirefoxMetzger commented 1 year ago

I like approach 1 best; it feels very pandas-like.

I do a lot of my transformations via df.assign(foo=lambda df: <expression> and approach 1 feels like a natural extension of that workflow. Further, some seaborn objects (like so.Est) already take a lambda df: <expression> as input, so I think it would be consistent from an API perspective, too.

It also allows us to do logic on top of things that seaborn is computing like:

(
    so.Plot(data, y="category", x="awesomeness", text="awesomeness")
    .add(so.Bar(), so.Agg())
    # assuming df contains the values transformed by so.Agg
    .add(so.Text(), so.Agg(), halign=lambda df: "right" df.awesomeness < threshold else "left")
)

With the above, I can change the parameters of so.Agg() (say median over mean), and the alignment will update accordingly. I can't do this a-priori on the dataframe, because I don't know what the final value will be. I could precompute the value of so.Agg() in the dataframe and then apply the alignment logic on that level, but that kind of defeats the point of having so.Agg(), since I wouldn't want to keep changing code in two places.

But 1 is verbose enough that I'm more likely to do that transform directly on the DF and then pass it to the plotting func

I see this differently actually. I'd much prefer plot-specific transformations to live close to the plot; especially when adding multiple objects in the same figure. I can, of course, do all that computation in so.Plot((df.assign(...)), ...), but it's cleaner to have it right next to the object that consumes it.


Approach 3 is also interesting; however, I don't quite understand why we need the lambda: here. I doubt that any serious use-case uses tip / total_bill as an actual column name, so I would assume that something like the following would do the right thing in 99.8% of all scenarios: df[name] if name in df.columns else df.eval(name).

Loosing the lambda: would make this approach complementary to approach 1, and we could have both. "<expression>" for simple transformations on the dataframe and lambda df: <expression> for more complicated operations (potentially involving variables in the current scope).

mwaskom commented 1 year ago

With the above, I can change the parameters of so.Agg() (say median over mean), and the alignment will update accordingly. I can't do this a-priori on the dataframe, because I don't know what the final value will be.

This probably can't work the way that you're hoping because it's ambiguous as to when you're going to get the column out of the dataframe. What tells seaborn that you want the column after the aggregation and not before it? And if you have multiple transforms, it gets more complicated. I have some ideas here, but it's tricky, and I don't think that any of option 1/2/3 necessarily solve it better.

I'd much prefer plot-specific transformations to live close to the plot; especially when adding multiple objects in the same figure.

My recommended approach here is to pipeline

(
    df
   .assign(...)
   .pipe(so.Plot, ...)
   .add(...)
)

so I would assume that something like the following would do the right thing in 99.8% of all scenarios: df[name] if name in df.columns else df.eval(name).

Perhaps, although the error handling gets more complicated, I think. It also feels a little dangerous since DataFrame.eval is just calling eval behind the scenes and I don't think it has any way of avoiding the risks of arbitrary code execution. My guess is that you want it to be a little bit more obvious to people when a string is going to get evaluated as code.

FirefoxMetzger commented 1 year ago

This probably can't work the way that you're hoping because it's ambiguous as to when you're going to get the column out of the dataframe. What tells seaborn that you want the column after the aggregation and not before it? And if you have multiple transforms, it gets more complicated. I have some ideas here, but it's tricky, and I don't think that any of option 1/2/3 necessarily solve it better.

I'm not sure I follow; do you mean that it is ambiguous when to apply the lambda as in what should happen if I define something in a .Plot vs a .add?

(
    so.Plot(
        data,
        y="category",
        x="awesomeness",
        text="awesomeness",
        halign=lambda df: "right" df.awesomeness < threshold else "left"
    )
    .add(so.Bar(), so.Agg())
    # assuming df contains the values transformed by so.Agg
    .add(so.Text(), so.Agg())
)
# vs.
(
    so.Plot(data, y="category", x="awesomeness", text="awesomeness")
    .add(so.Bar(), so.Agg())
    # assuming df contains the values transformed by so.Agg
    .add(so.Text(), so.Agg(), halign=lambda df: "right" df.awesomeness < threshold else "left")
)

In this case, I would have expected that setting halign in so.Plot uses data and that setting halign in Plot.add uses the resulting of so.Agg() (and any other transformations that are applied).

This could become strange, if I can use a kwarg set in Plot.add to modify a transformation, but I don't think this is possible right now? Also, my mental model "resolves" seaborn objects left-to-right, but considering that the type (eg. so.Bar()) comes first, maybe we should think of this as right-to-left, in which case it becomes strange?

My recommended approach here is to pipeline [code]

Yes, this is very similar to what I am doing now:

(
    so.plot(df.assign(...), ...)
   .add(...)
)

(I do prefer your way of writing it though 😎)

Doing so works, thus being able to pull the computation into .add it's more of a nice-to-have than an absolute must. In my mind, it does reduce cognitive load because it makes data sources more explicit. It's the same as with something like

(
    so.plot(df, x="time")
   .add(so.Line(), y="factor1")
   .add(so.Line(), y="factor2")
)
# vs. the implicit
(
    so.plot(df, x="time", y="factor2")
   .add(so.Line(), y="factor1")
   .add(so.Line())
)

It also feels a little dangerous since DataFrame.eval is just calling eval behind the scenes and I don't think it has any way of avoiding the risks of arbitrary code execution.

That is very true, though - unless I am mistaken - you would have the same problem in approach 3 when you use x=lambda: "tip / total_bill". If we can resolve it there, then the same logic could apply to the non-lambda scenario.

DataFrame.query has some nice logic implemented. I don't know if it calls eval internally as well, but it could be a source of inspiration on how to do this in a good way?

mwaskom commented 1 year ago

In this case, I would have expected that setting halign in so.Plot uses data and that setting halign in Plot.add uses the resulting of so.Agg() (and any other transformations that are applied).

That's definitely not the way things currently work; layer-specific variables are resolved prior to any transformations (and define groups for these transformations).

Also, my mental model "resolves" seaborn objects left-to-right

Yeah that's reasonable but not quite correct, the transforms are applied left-to-right but the entire signature of Plot.add() isn't. It's not really possible since it uses *transforms and **variables.

That is very true, though - unless I am mistaken - you would have the same problem in approach 3 when you use x=lambda: "tip / total_bill". If we can resolve it there, then the same logic could apply to the non-lambda scenario.

The difference is that it's less likely you'd be calling eval without meaning to or knowing what you're doing. First, you have to say "wtf is this lambda thing" and then you can read some docs with a warning. But more importantly, if you have an app where you're accepting user input (as strings) and passing it to Plot, you would need to worry about sanitizing the strings if they could trigger an eval, which isn't the case with the lambda construct.

mahlzahn commented 6 months ago

It would be great to have this included in some way. Here an ad-hoc wrapper for the first and second option:

import types
import pandas as pd
class Expr(str):
    ...
def evalize(f, **kwargs):
    if 'data' in kwargs:
        kwargs['data'] = kwargs['data'].copy()
        for k in kwargs:
            if type(kwargs[k]) is Expr:
                data_eval = kwargs['data'].eval(kwargs[k])
                if type(data_eval) is pd.Series:
                    kwargs['data'][kwargs[k]] = data_eval
                else:
                    kwargs['data'] = data_eval
                    kwargs[k] = kwargs[k].split('=', 1)[0].strip()
            elif type(kwargs[k]) is types.FunctionType:
                kwargs[k] = kwargs[k](kwargs['data'])
    return f(**kwargs)

which then can be used e.g., like

import seaborn as sns
df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
evalize(sns.scatterplot, data=df, x=Expr('A + B'), y=lambda d: d.eval('A - B'), hue=Expr('C = A / B'))

with both x and hue having nice labels of 'A + B' and 'C', respectively.

It has some issues, which could be fixed if properly implemented. E.g. it would not be possible to use Expr('A + B') and 'A + B' at the same time as the latter gets overwritten here before plotting. Also it breaks when using other keywords with callables as inputs.