pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

ENH: Method for selecting columns from DataFrameGroupBy (and maybe DataFrame) #40322

Open mwaskom opened 3 years ago

mwaskom commented 3 years ago

Is your feature request related to a problem?

When using a "fluent"/"method chaining" style of programming with pandas, a common hiccup is the lack of a method for selecting columns from a DataFrameGroupBy object. For example, take this DataFrame

import pandas as pd
df = pd.DataFrame(dict(
    x=[1, 2, 3, 4, 5, 6],
    y=[2, 4, 6, 1, 2, 3],
    a=["a", "b", "c", "a", "b", "c"],
    b=["x", "x", "y", "z", "z", "z"],
))

To group by one column and then return the mean of another column within each group, one has to use multiple syntactical constructs:

(
    df
    .groupby("a")
    ["x"]
    .mean()
)

There is also dot-attribute access columns:

(
    df
    .groupby("a")
    .x
    .mean()
)

but that is not fully general: it won't work if your column name collides with an existing method, and it won't work if the column is defined by a variable.

Another option is to select the column first then groupby using a series, not a name:

(
    df["x"]
    .groupby(df["a"])
    .mean()
)

This is not so bad. Well, ideally each step in the pipeline would be on a new line, and it requires some duplicated typing. But the bigger issue is that it fails when you want to group by more than one column:

(
    df["x"]
    .groupby(df[["a", "b"]])
    .mean()
)
Traceback ```python-traceback --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in 1 ( ----> 2 df["x"] 3 .groupby(df[["a", "b"]]) 4 .mean() 5 ) ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/series.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna) 1689 axis = self._get_axis_number(axis) 1690 -> 1691 return SeriesGroupBy( 1692 obj=self, 1693 keys=by, ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna) 558 from pandas.core.groupby.grouper import get_grouper 559 --> 560 grouper, exclusions, obj = get_grouper( 561 obj, 562 keys, ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna) 826 # allow us to passing the actual Grouping as the gpr 827 ping = ( --> 828 Grouping( 829 group_axis, 830 gpr, ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis, dropna) 541 if getattr(self.grouper, "ndim", 1) != 1: 542 t = self.name or str(type(self.grouper)) --> 543 raise ValueError(f"Grouper for '{t}' not 1-dimensional") 544 self.grouper = self.index.map(self.grouper) 545 if not ( ValueError: Grouper for '' not 1-dimensional ```

(I actually feel like this really ought to work, independently of whether it's the best solution to this Feature Request, but let's leave it aside for now).

So in summary: There's no fully-general, non-awkward method for selecting a column from DataFrameGroupBy. While certainly not fatal, it does make fluent pandas harder to read[1] and write[2]. And this is not an esoteric application: groupby/select/apply must be one of the most common pandas workflows.

[1] The great advantage of this method chaining approach for understanding code is that you can read off the sequence of verbs that comprise the operation. Introducing the [column] syntax requires your brain to shift from comprehension mode to production mode — you need to generate a verb for what's happening — which is a costly operation that likely (on the margin) impairs comprehension.

[2] I really like fluent pandas but it currently can be a little bit annoying to fight with autoindent/autoformat tools when writing it. It likely would be easier to improve that tooling if one could assume each pipeline step begins with a dotted method access.

Describe the solution you'd like

The DataFrameGroupBy object could have a column selection method.

Possible names:

DataFrameGroupBy.get

This exists in DataFrame as a thin wrapper around self[arg]. Possibly adds confusion with DataFrameGroupBy.get_group

DataFrameGroupBy.select

This is the method I have reached for more than once. I'm broadly aware of the history of NDFrame.select; this name should now be available, though it could cause some confusion to reintroduce it with different semantics (but xref https://github.com/pandas-dev/pandas/issues/26642).

It probably wouldn't make sense to add .select only to DataFrameGroupBy and not to DataFrame/NDFrame. So this would be more work (but also more benefit?)

In code, that might look like

(
    df
    .groupby("a")
    .select("x")
    .mean()
)

API breaking implications

No breakage, unless someone has very old code using DataFrame.select that completely missed the original deprecation cycle and gets resurrected now.

Adding new methods (especially new ways of indexing pandas objects) definitely has an API complexity cost. But I would argue that, by converging on "one way to do it" in method chains, it could provide a a net reduction in complexity from the user perspective.

Additional context

This originally got some discussion on twitter, note especially @TomAugspurger's response.

Other relevant context would be the select verb in dplyr.

TomAugspurger commented 3 years ago

I think I like the idea of ripping off dply'rs select (targeted to selecting ~rows~ columns). It feels a bit more natural than NDFrame.filter, which seems like it should operate on the rows (feels like a SQL where clause).

So overall I think +1 to adding select to NDFrameGroupBy and DataFrame. I think I like dplyr's select's API where the first argument can be a scalar, sequence, or callable, more than NDFrame.filter which spreads these out. I think that callables would be passed each scalar in the .columns and would be expected to return a bool indicated whether that key is selected.

cbrnr commented 3 years ago

Just to clarify, dplyr's select selects columns - what do you mean by "(targeted to selecting rows)" in your first sentence?

TomAugspurger commented 3 years ago

By "targeted to selecting rows" I obviously meant "targeted to selecting columns" :) Fixed.

jreback commented 3 years ago

why is

(
    df
    .groupby("a")
    ["x"]
    .mean()
)

just not the answer here? why is 'let's just add api' the answer?

I am -1 on adding almost anything at this point to an already very complicate interface.

mwaskom commented 3 years ago

Because either pandas supports method chaining as an interface or it doesn't.

If it does, then it should be possible to define a sequence of operations using a single syntactical construct. Needing to switch back and forth has a cost when learning pandas, when writing code and, and when reading code.

Now obviously there will be some edge cases that aren't possible through the pandas API, which is where .pipe comes in handy.

But this usecase (group by then select) is likely one of the most common operations in pandas.

I get that there is a maintenance cost to increasing the API surface, but users accrue real benefits from being able to do things in a consistent, predictable way.

jreback commented 3 years ago

Because either pandas supports method chaining as an interface or it doesn't.

again what is wrong with the first construct itself. This does support method chaining. What exactly do you want to do that is not possible?

mwaskom commented 3 years ago

It supports method chaining, then, oops, wait, gotta use the indexing operator, then ok some more method chaining.

That's not exactly the same thing.

I think it would be a mistake to blanket reject API additions that enhance consistency just because they don't add sufficient novel functionality.

(That said, some people feel excited about using a callable to select, which isn't possible in the current API).

Batalex commented 3 years ago

Hi,

the current pandas workflow incites to "extract" the bare minimum data, then apply transformations. This proposal would make it easier to go the other way and avoid those types of method chaining:

df.query("sex == 'male'").groupby("species").apply(
    lambda x: x.filter(regex="_mm").mean()
)

That being said, this ⬆️ got the job done so far so I totally understand the point on the maintenance burden.

mwaskom commented 3 years ago

One thing I really like about Python is that usually, when you find yourself doing something that feels really awkward, it's a clue that there's a better way to do it.

That intuition has trapped me here many times. I write a pipeline that mixes method and index chaining, think "no way, that's too ugly, there must be a better syntax for this standard task" and then sink time trying to figure out what the right method is called.

jreback commented 3 years ago

It supports method chaining, then, oops, wait, gotta use the indexing operator, then ok some more method chaining.

so you are saying that python itself should not have [] operators and instead have .getitem() or whatever.

but we have these operators. I am not sure what you are actually suggesting, e.g. .select() sure, let's introduce a completely new function that has now precedence.

and please don't put words in my statement

I think it would be a mistake to blanket reject API additions that enhance consistency just because they don't add sufficient novel functionality.

I don't reject things off hand, instead they must have a really high bar. It doesn't need to be novel, just reduce complexity. The proposal may reduce complexity a small amount, at the enormous expense of adding non-standard usages.

mwaskom commented 3 years ago

When someone takes a while to write an informative feature request, with an explanation, and then just gets a "but why add this" response that doesn't engage at all with what was written, it suggests there is a kneejerk policy of rejection. (Which is fine too, you're the project lead, but it doesn't really incentivize thoughtful feature requests).

I don't think everything should be a method call. But I do think that when everything but one step in a chain of operations is a method call, it's ugly and confusing.

If pandas default were to operate in place, with an option for returning a copy, it would be reasonable to say "well we don't really support method-chaining as a first-class interface". But your design decisions have pushed users in that direction.

And the success of pandas — and dplyr pipelines in R — has proven that it's a really useful idiom for data transforms. So it's' painful when there's one (really common) step where it breaks down.

attack68 commented 3 years ago

I don't think everything should be a method call. But I do think that when everything but one step in a chain of operations is a method call, it's ugly and confusing.

I disagree. I appreciate its subjective but I see no issue with the current syntax, in fact I prefer the simplicity of the operators to another method, don't find it ugly nor painful. I think it would be more confusing to have and maintain multiple syntax achieving the same result (there are other areas I would like to see focus/consistency on selecting data). However, its is a useful comment to raise discussion, its just not for me.

mwaskom commented 3 years ago

in fact I prefer the simplicity of the operators to another method

Why are operators "simpler" than a method call, in the context of a sequence of method calls?

Bear in mind that pandas overloads the behavior of [ ] in various ways (i.e. sometimes that's row selection, sometimes it's column selection).

mwaskom commented 3 years ago

Note that [ ] on a DataFrameGroupBy behaves differently from [ ] on a DataFrame:

# Works
df[df["b"] == "x"].mean()

# Errors
df.groupby("a")[df["b"] == "x"].mean()

And a downside of the operators is that they don't have a docstring you can quickly check to see how they work. You need to dig up a description of the behavior from ... somewhere in the docs.

An advantage of adding a .select to both is that it could have the same semantics in both contexts, and it could be easy to discover what it does.

jreback commented 3 years ago

When someone takes a while to write an informative feature request, with an explanation, and then just gets a "but why add this" response that doesn't engage at all with what was written, it suggests there is a kneejerk policy of rejection. (Which is fine too, you're the project lead, but it doesn't really incentivize thoughtful feature requests).

for being the author of a nice package this is a really disrespectful response. This is a very valid response. Pandas api is WAY too big. I think we can all agree on that. The bar for adding anything is REALLY high. The initial response to virtually anything is NO.

YOU have to have a really convincing argument to persuade. Feel free to get support on this issue.

mwaskom commented 3 years ago

I don't mean disrespect, but I spent my whole morning writing up an issue that had clear examples, considered various alternatives, acknowledged the cost of an API expansion, and gave reasons for why I think the existing syntax is a problem. To get a response that demanded justification for the proposal without engaging on anything I wrote in the initial post was frustrating. This isn't the first time I've had this experience, and when I have ideas for features in pandas I often just don't even bother opening an issue: not because the bar to acceptance is high, but because the effort of writing a clear proposal is not respected.

But we're getting distracted.

I guess i'll copy what I see as the key point

The great advantage of this method chaining approach for understanding code is that you can read off the sequence of verbs that comprise the operation. Introducing the [column] syntax requires your brain to shift from comprehension mode to production mode — you need to generate a verb for what's happening — which is a costly operation that likely (on the margin) impairs comprehension.

This is a cognitive explanation for the (yes, subjective) reaction to the "ugly" code. It's ugly because it requires you to do too much work to understand it.

IMO it is totally valid to consider consistency as a justification for increasing API. I reacted negatively to your question of "What exactly do you want to do that is not possible?" because I think I was pretty clear that the problem is not what can be done but how it is to be done.

jreback commented 3 years ago

@mwaskom

I don't mean disrespect, but I spent my whole morning writing up an issue that had clear examples, considered various alternatives, acknowledged the cost of an API expansion, and gave reasons for why I think the existing syntax is a problem. To get a response that demanded justification for the proposal without engaging on anything I wrote in the initial post was frustrating. This isn't the first time I've had this experience, and when I have ideas for features in pandas I often just don't even bother opening an issue: not because the bar to acceptance is high, but because the effort of writing a clear proposal is not respected.

my response had almost nothing to do with the what you wrote / or the time you spent. It actually was reading it from a glance. I couldn't tell why the first example was not correct. Your narrative may have looked clear to you since you spent time on it. But my goal is to read these w/o a lot of context and try to determine what exactly is the problem and what the proposed solution is. Then I ask clarifying questions.

This interaction clearly failed. If someone spends a lot of time writing something up, but its not clear at a glance what its about., then it needs revision.

Most constributors / reviewers do not have the time to real a long extended treatise on things (not that your is, but just saying). Some like to write really long things. To be honest this is a burden on all, you can take your time to write whatever you want and expect that everyone else has the time to read it. Again I am not saying that is the case here.

So my point is to be even more clear and edit / clarify the writeup. I asked several times 'why' is this not fluent and all I got defensive arguments

mwaskom commented 3 years ago

Thanks for clarifying your approach. I'll consider this resolved.

erfannariman commented 3 years ago
(
    df["x"]
    .groupby(df[["a", "b"]])
    .mean()
)

Note: this can still be achieved:

(
    df["x"]
    .groupby([df["a"], df["b"]])
    .mean()
)
jorisvandenbossche commented 3 years ago

@mwaskom thanks a lot for opening the issue and providing a nice, structured proposal and argumentation.

Let's not close this issue because the interaction with Jeff didn't go that well, but give time for others to join the discussion.

Personally, I am +1 on adding a select() method, both to DataFrame and DataFrameGroupby. For me, that has been a big reason in deprecating the previous version, so we can actually use it for this purpose (https://github.com/pandas-dev/pandas/issues/26642). I agree with the rationale that it would be nice that such a basic operation is fluently possible in a method chaining syntax. And I also think that a method is more suitable for advanced features compared to loc (like passing a callable, or regex pattern, or ..).

mwaskom commented 3 years ago

Alright, if this remains under discussion, I'll try to summarize the points I've made.

  1. My main claim is that an index operator in the middle of a method chain is ugly.
  1. Named methods have benefits over operator overloading
  1. The relative benefits of the index operator don't apply here
CharlesHe16 commented 3 years ago

Hi @mwaskom

Do you have an example (maybe wildy ignoring Pandas / Python rules) to illustrate what you want?

I think I understand the post and it is well written. I just don't see the counterfactual.

Also, I really agree with this:

One thing I really like about Python is that usually, when you find yourself doing something that feels really awkward, it's a clue that there's a better way to do it.

That intuition has trapped me here many times. I write a pipeline that mixes method and index chaining, think "no way, that's too ugly, there must be a better syntax for this standard task" and then sink time trying to figure out what the right method is called.

There are many workflows which feel extremely weird to execute in Pandas, even to a vivid degree.

This is sort of unrelated but one example is the use case of creating a unique "ID" per group.

Until 2018/2019 this seems to involve undocumented functionality until being addressed by ngroup(). See: https://stackoverflow.com/a/15074395/14929415

Again, off-topic but perhaps this is due to some "path dependency"/ "founder effect" of its original financial time series origin, which may then produced counter intuitive effects.

jorisvandenbossche commented 3 years ago

Do you have an example (maybe wildy ignoring Pandas / Python rules) to illustrate what you want?

Using the example of the top post (that shows different ways how it is possible right now to achieve this), with a select() method, that could become:

(
    df
    .groupby("a")
    .select("x")
    .mean()
)
CharlesHe16 commented 3 years ago

Thanks @jorisvandenbossche,

This is helpful to me, someone who had to look up "fluent" programming.

I think this example makes a lot of sense now and I think I agree with the OP's point (barring the issues of dependencies, coordination, mental energy, open source cat herding, burn out, etc. that may or may not be touched on by other discussion).

mwaskom commented 3 years ago

Thanks @jorisvandenbossche; edited the initial post to add that and make the proposal more explicit.

jbrockmendel commented 3 years ago

@mwaskom thanks for taking the time. A couple of questions to make sure i understand correctly

1) The title says "and maybe DataFrame", but the discussion looks to be all about GroupBy. Is the DataFrame bit a "and in principle this could also apply to" thing, or is there part of the discussion I missed?

2) The basic version of this is basically just asking for an aesthetically nice alias to __getitem__?

mwaskom commented 3 years ago

The title says "and maybe DataFrame", but the discussion looks to be all about GroupBy. Is the DataFrame bit a "and in principle this could also apply to" thing, or is there part of the discussion I missed?

No we haven't really discussed whether it should be a DataFrame method too, but in principle the same arguments for (and against) would apply, and IMO things are more predictable if DataFrame and DataFrameGroupBy have the same methods in situations where they make sense on both classes.

The basic version of this is basically just asking for an aesthetically nice alias to getitem?

Yes, others are excited about using a method (v.s. index operator) to allow more elaborate selection rules (possibly overlapping with DataFrame.filter, one should note), but the core proposal is very simple.

rhshadrach commented 3 years ago

One might also consider how black treats method chaining. Running with black version 21.5b2 on

df = (
    pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
    .groupby(["a", "b"])
    .min()
    .reset_index()
    [["a", "c"]]
    .groupby(["a"])
    .mean()
)

results in

df = (
    pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
    .groupby(["a", "b"])
    .min()
    .reset_index()[["a", "c"]]
    .groupby(["a"])
    .mean()
)

In my opinion this is an undesirable result - the subsetting of columns should be on its own line, otherwise it is easy to miss when glancing at the code. However perhaps this is an upstream (black) issue.

kwhkim commented 2 years ago

I may add my opinion on this.

I agree with OP's point that introducing .select() would increase readability.

I think API consistency is important for users that do not use pandas everyday.

I personally think there are so many variation for the same method for different class(ex. DataFrame, DataFrameGroupBy, etc.)

And I personally think so many methods for the same function is not a downside, (Users can select what they want and documentation can filter out what is more fundamental or basic) and I found inconsistency from the arguements "No more methods because it will confuse users", and "No more methods for less confusing syntax".

Arguments above inevitably makes pandas stuck in where it is now.

Anyhow, I myself implemented the method select and monkey-patched it to DataFrame and DataFrameGroupBy

import pandas as pd
def select(self, *args):
    #print(args, len(args))
    if len(args)==1:
        if isinstance(args[0], list):                 
            args = args[0]            
        else:            
            args = [args[0]]                        
    else:
        args = [x for x in args]
    #print(args)
    return self.__getitem__(args)

pd.DataFrame.select = select
pd.core.groupby.generic.DataFrameGroupBy.select = select

it's just a wrapper for .__getitem__() and it seems to work fine anyhow.

df = pd.DataFrame(dict(
    x=[1, 2, 3, 4, 5, 6],
    y=[2, 4, 6, 1, 2, 3],
    a=["a", "b", "c", "a", "b", "c"],
    b=["x", "x", "y", "z", "z", "z"],
))

print(df.select(['x', 'y']))
##    x  y
## 0  1  2
## 1  2  4
## 2  3  6
## 3  4  1
## 4  5  2
## 5  6  3

print(df.select('x', 'y'))
##    x  y
## 0  1  2
## 1  2  4
## 2  3  6
## 3  4  1
## 4  5  2
## 5  6  3

print(df.groupby(['a', 'b']).select('x', 'y').mean())
##        x    y
## a b          
## a x  1.0  2.0
##   z  4.0  1.0
## b x  2.0  4.0
##   z  5.0  2.0
## c y  3.0  6.0
##   z  6.0  3.0

print(df.groupby(['a', 'b']).select(['x', 'y']).mean())
##        x    y
## a b          
## a x  1.0  2.0
##   z  4.0  1.0
## b x  2.0  4.0
##   z  5.0  2.0
## c y  3.0  6.0
##   z  6.0  3.0