pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.82k stars 17.65k forks source link

API: capabilities of df.set_index #24046

Open h-vetinari opened 5 years ago

h-vetinari commented 5 years ago

This is coming out of a discussion that has stalled #22225 (which is about adding .set_index to Series, see #21684). The discussion has shifted away from what capabilities a putative Series.set_index should have, but what capabilities df.set_index has currently.

The main issue (for @jreback) is that df.set_index takes arrays:

@jreback: There were several attempts to have DataFrame.set_index take an array as well, but these never got off the ground.

@h-vetinari: I'm not sure when, but they certainly did get off the ground:

>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.23.4'
>>>
>>> df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), columns=list('abcd'))
>>> df.set_index(['a',          # label
...               df.index,     # Index
...               df.b ** 2,    # Series
...               df.b.values,  # ndarray
...               list('ABCD'), # list
...               'c'])         # label again
              b  d
a   b      c
0 0 0  2 A 1  0  2
8 1 1  4 B 4  1  4
3 2 25 5 C 8  5  5
0 3 9  7 D 2  3  7

Further on:

@jreback: @h-vetinari you are confusing the purpose of .set_axis. [...] The problem with .set_index on a DataFrame with an array is that it technically can work with an array and not keys. (meaning its not unambiguous)

I don't think I am confusing them. If I want to set the .index-attribute of a Series/DataFrame, then using .set_index is the most reasonable name by far. If anything, set_axis should be a superset of set_index (and a putative set_columns), that just switches between the two based on the axis-kwarg.

More than that, the current capabilities of df.set_index are a proper superset of df.set_axis(axis=0)*, in that it's possible to fill keys with only* Series/Index/ndarray/list etc.:

>>> df.set_index(pd.Index(df.a))  # same result as Series directly below
>>> df.set_index(df.a) 
   a  b  c  d
a
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7
>>> df.set_index(df.a.values)  # same result as list directly below
>>> df.set_index([[0, 8, 3, 0]])
   a  b  c  d
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7

** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e. df.set_index([[0, 8, 3, 0]]) instead of df.set_index([0, 8, 3, 0]). This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).

Summing up:

Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.

EDIT: Forgot to add an xref from @jreback:

@h-vetinari we had quite some discussion about this: #14829 and never reached resolution. This is an API question.

In that issue, there's discussion largely around .rename, and how to make that method more consistent. Also discussed was potentially introducing .relabel, as well as .set_columns.

toobaz commented 5 years ago

or you mistyped...?)

Yeah, I meant set_index. You say set_index is the first thing people turn to, and I totally agree. Then you say that doing it with arrays is "reasonable"... which is different ;-), and there's already a method for that.

It would be a _super_set

No,"deprecate it" = deprecate set_index. (By the way, it was purely hypothetical reasoning)

For the rest, you make some confusion between removing functionality (the only functionality removed in my proposal would be the mixed case, which you admit is not so important) and just making the API more tidy, asking some users to replace a method name.

I'm all against 3 because I'm all against special cases (and I think I didn't even take part in the discussion mentioned in https://github.com/pandas-dev/pandas/pull/24697 !)

ghost commented 5 years ago

Edited

I should apologize for restarting this discussion. It was incredibly naive of me to think I could make a difference where so many have tried before and now I'm letting this go.

h-vetinari commented 5 years ago

@toobaz: For the rest, you make some confusion between removing functionality [...] and just making the API more tidy

Well, df.set_index currently supports arrays, and (IIUC) you want to remove that functionality.

@toobaz: Then you say that doing it with arrays is "reasonable"... which is different ;-), and there's already a method for that.

I disagree that it's unreasonable/different, and also that set_axis can serve as a direct replacement; because it's not obvious except to people who know the arbitrary idiosyncrasy that for using arrays, they have to switch to set_axis.

toobaz: I'm all against 3 because I'm all against special cases (and I think I didn't even take part in the discussion mentioned in #24697 !)

The "special-case" objection is fair enough, but it has to be weighed against the impact of other alternatives, namely ambiguity of list_of_scalar, or removing core functionality (i.e. using arrays in set_index). To me, both of the latter are worse resp. much worse than having a special case for the list, which has to be interpreted in too many places (leading to other special cases). This is very similar to how tuples are also overused (and about tuples you agree, AFAIK... #24702).

h-vetinari commented 5 years ago

@pilkibun: Anyway, I should apologize for restarting this discussion. It was incredibly naive of me to think I could make a difference where so many have tried before and now I'm letting this go.

IMO you shouldn't apologize!

  1. you couldn't have known [realistically speaking]
  2. it's important to draw attention to these deficiencies, no matter how entrenched. Just because a certain corner of the API is difficult/delicate shouldn't mean it cannot ever be rectified/improved.

So from my side: thanks for having picked this up again (however briefly). ;-)

toobaz commented 5 years ago

About terminology: let's not talk about "removing core functionality" when we mean reshaping the API by removing duplicates. The only functionality removal we are talking about here is the mixed case. Only by avoiding provocations we can try to reduce the sheer amount of noise in this issue.

To me, both of the latter are worse resp. much worse than having a special case for the list, which has to be interpreted in too many places (leading to other special cases). This is very similar to how tuples are also overused (and about tuples you agree, AFAIK... #24702).

I think you have a point if you are saying (not sure I understood correctly) that we tend to represent MultiIndex keys with tuples, and that the content of nested lists are MultiIndex keys. But I think there is a stronger point: where we accept arrays, we also accept lists; and if we accept 2D arrays, we will accept lists of lists. In other words, before data becomes the content of a MultiIndex, it is "just data".

h-vetinari commented 5 years ago

@toobaz: Only by avoiding provocations we can try to reduce the sheer amount of noise in this issue.

There's no provocation; I will however correct it if my argument is misrepresented (unintentionally).

About terminology: let's not talk about "removing core functionality" when we mean reshaping the API by removing duplicates. The only functionality removal we are talking about here is the mixed case.

I'm very sorry, but it is removing core functionality - namely "arrays in set_index". It was an enhancement when it was added, and 7 years later, it is justifiable to call it core functionality (esp. considering how pandas is all about indexed data).

Talking about set_index and set_axis as "duplicates" is flawed IMO, because they deal with the exact same objects ("axes") - they are much less duplicates, rather than two sides of the same coin. If you want to deduplicate on the implementation side, you can make set_axis([...], axis=0) use the implementation of set_index([...]).

TLDR: Having different behaviour for set_index and set_axis is IMO bad API-design.

I think you have a point if you are saying [...]

My point was mostly different: lists - like tuples - are used in too many different roles, forcing many code paths for all the different scenarios in many places. Just as I support interpreting tuples always as ~scalars~ MultiIndex keys (and not, e.g. as array-likes; cf. #24702), I would argue that lists should preferably have the role of a sequence of objects (as in: "these columns are supposed to form a MultiIndex"), and also give up the array-like interpretation (which can be trivially replaced by Series/Index/np.ndarray/etc.).

h-vetinari commented 5 years ago

I forgot to add a fifth option to the recap above: Turning the signature from

def set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False):

to

def set_index(keys=None, arrays=None, drop=True, append=False, inplace=False,
              verify_integrity=False):

as suggested further up the thread. This would also sacrifice the mixed case, but make it very explicit what's being demanded (which would also resolve the ambiguity of list_of_scalar).

PS. @pilkibun, adding materially new content to a comment through an edit makes it hard to see resp. respond to.

ghost commented 5 years ago

Sorry, I removed the new content and also most of the old.

Without arguing for one position over another, here's the state of things as seen from the user's point of view who looks to the documentation for guidance:

# grep -R '\.index' doc/source
whatsnew/v0.14.0.rst
158:    df_multi.index = tuple_ind
164:    df_multi.index = mi

whatsnew/v0.16.0.rst
98:   s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),

user_guide/indexing.rst
1706:   data.index = index

getting_started/10min.rst
653:   ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

user_guide/io.rst
1893:   dfj2.index = pd.date_range('20130101', periods=5)
2952:   df.index = df.index.set_names(['lvl1', 'lvl2'])

getting_started/basics.rst
233:   dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),

user_guide/timeseries.rst
2096:   ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

user_guide/sparse.rst
306:   s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),

At the same time, all example uses of set_index are for the column case and there are 0 usage examples of set_axis which appears only in the auto-generated reference.

I think it's fair to say that the docs currently advocate for always setting the index directly and gives no indication that setting a series index in a method-chain is supported. I care about this more than I care about which competing alternative wins out.

My personal preference is still to keep set_index as it is and add its equivalent to Series. And in any case to update the documentation accordingly.

toobaz commented 5 years ago

I'm very sorry, but it is removing core functionality

You're not trying to understand.

toobaz commented 5 years ago

give up the array-like interpretation

I'm pretty sure this will never pass.

h-vetinari commented 5 years ago

@h-vetinari: I'm very sorry, but it is removing core functionality

@toobaz: You're not trying to understand.

I understand that, to you, the functionality is not removed because it lives on in set_axis. We just disagree about what constitutes good API design.

@h-vetinari: give up the array-like interpretation

@toobaz: I'm pretty sure this will never pass.

It doesn't have to be everywhere or at the same time, but removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability. I'm saying it could be a goal to strive for, like striving to have tuples always be MI-keys (and was an argument why "special-casing" in option 3 can be desirable).

@pilkibun: [...] and there are 0 usage examples of set_axis which appears only in the auto-generated reference.

@toobaz, this is part of the reason why I consider shifting the array-capability from set_index to set_axis as a removal of functionality - it is not nearly as intuitive or wide-spread as set_index.

toobaz commented 5 years ago

We just disagree about what constitutes good API design.

We mostly disagree on what constitutes productive interaction. And from my part, this is the last "metacomment" in this discussion.

removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability

I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index.

So among the options you listed, only 4 and 5 are feasible. To be honest I don't like 5, but this might be subjective. Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?

h-vetinari commented 5 years ago

I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index.

I agree that users should not pay the price. I also don't want the overloaded interpretation in set_index, but would tackle it differently.

Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?

As a deprecation, yes. After the removal, not so much (because set_index will still be the more obvious choice, also for arrays).

So among the options you listed, only 4 and 5 are feasible.

I think 3 and 5 are feasible. The overlap is 5, and I'd be happy to support that.

We mostly disagree on what constitutes productive interaction.

Indeed, and we both have our failings there. Thanks for taking the time/energy to stick with it.

toobaz commented 5 years ago

because set_index will still be the more obvious choice, also for arrays

Why "obvious"? For the name "set_index" vs. "set_axis"? Or because you like everything in the same method? (either way, I think it's just a matter of having clear documentation)

I think 3 and 5 are feasible.

3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments. I guess we never discussed much 5, so I'll be happy to mention it in the next live chat. I still think we will end up choosing 4.

h-vetinari commented 5 years ago

Why "obvious"? For the name "set_index" vs. "set_axis"?

That plays a large role, yes, because we're set-ting the index-attribute, and because having an intuitive API is crucial. Documentation does not help with intuitiveness.

3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments.

I have enough arguments**, but so far you have hardly responded to them, except by appeal to authority. 4 or 5 is your opinion, 3 or 5 is mine. Of course, you're in a much better position to enforce your ideas, but that does not improve the strength of your argument.

Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.

I realised I haven't even fully articulated one of the most important arguments against 4: saying that arrays can only be used in set_axis does not solve the lack of a set_index-method for Series**! The rest of my main arguments is recapped below the fold:

* that "arrays-in-`set_index`" was a specific [enhancement](https://github.com/pandas-dev/pandas/commit/553bacc7abe341991e4d043a5c17521902c7f127) and that this is core functionality not least due to its intuitiveness and centrality of indexes to pandas * that axis/index are two sides of the same coin (and not two duplicate coins), and that having different behaviours between `set_index` and `set_axis` is further detriment to intuitiveness * that deprecating `list-as-array` (vs. `list-as-collection`) is a goal to strive for, much more than a special case
toobaz commented 5 years ago

appeal to authority

This is a way to waste the time of both of us. So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things.

But since you later repeated your arguments in a tidy way, I will for the last time to assume you're trying to be constructive, and repeat my objections to your arguments. But neither your arguments nor my replies are new - and my replies are not just mine, other devs contributed to this and related discussions. If you want to just discuss the points below over and over, let's do this in a live chat so at least we wast less time.

that "arrays-in-set_index" was a specific enhancement

The fact that an ability is added later on does not mean it cannot be deprecated (actually the opposite).

and that this is core functionality not least due to its intuitiveness and centrality of indexes to pandas

"setting index with arrays" is a functionality (not so sure it is even a "core" one - setting from columns is much more frequent in the code I see), having it in set_index is just a matter of organizing the API. axis is not a strange term in pandas, it is one of the core concepts, so while index might be even more immediate, we are definitely not hiding the functionality by keeping it in set_axis.

that axis/index are two sides of the same coin (and not two duplicate coins), and that having different behaviours between set_index and set_axis is further detriment to intuitiveness

Your argument would suggest we want to make them almost-duplicates (apart from the axis= argument), and this is precisely something any good API should avoid. Vice-versa, separating the functionalities allows for less ambiguity (no, "ambiguity" is not an argument against option 5, but having a single method do widely different things based on arguments is not a favor to our users).

that deprecating list-as-array (vs. list-as-collection) is a goal to strive for, much more than a special case

As already clearly stated: you are right this would simplify our life, but we don't want our users to pay for. We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now. As for terminology, 2D arrays are collections of 1D arrays, which are collections of elements, so lists of lists make perfect sense.

saying that arrays can only be used in set_axis does not solve the lack of a set_index-method for Series

There is nothing to "solve" if the set_index does something which is completely useless for Series - we have set_axis that does the useful part. And we have discussed this some time ago. This said, if we do find that some sort of Series.set_index is useful for compatibility (like the MultiIndex methods which were backported to flat Indexes as idempotent), I will have no general objections. But it will be a result of, not an argument for, our decision about DataFrame.set_index, and it is OT here.

h-vetinari commented 5 years ago

So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things. [...] I will for the last time to assume you're trying to be constructive, [...]

I appreciate the time you took, and thanks for that. Indeed I was trying to have a productive discussion, and what seemed like rants to you was my response to the perception that you dismissed my points out of hand (until your last reply).

Your argument would suggest we want to make them almost-duplicates (apart from the axis= argument), and this is precisely something any good API should avoid.

The point is that the concepts axis and index are "almost-duplicates" already, and so their methods should reflect that. Thinking of the methods as duplicate is having things backwards, because the concept comes before the method.

We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now.

Except maybe if it is the cause of the hotly-contested ambiguity of list_of_scalar that stalled this whole discussion since about a year. It's clear that there are no great choices here, but leaving users to decipher the difference between set_index([0, 1, 2]) and set_index([[0, 1, 2]]) is not amazing either.


Anyway, thanks for taking the time to respond. I'll note that you didn't react to my attempt at extending an olive branch though:

@h-vetinari: Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.

toobaz commented 5 years ago

The point is that the concepts axis and index are "almost-duplicates" already, and so their methods should reflect that

In API design, "adherence to language" is only one of the many arguments - and an argument which is not new in this discussion. And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.

the hotly-contested ambiguity of list_of_scalar that stalled this whole discussion since about a year

You seem to think that this problem is important because it stalled for a year - maybe it stalled for a year because there were more important things ;-)

It's clear that there are no great choices here, but leaving users to decipher the difference between set_index([0, 1, 2]) and set_index([[0, 1, 2]]) is not amazing either.

Flat lists and nested lists should be clearly different objects to our users.

I'll note that you didn't react to my attempt at extending an olive branch though

I did. I said I don't like 5, I said why, but I also said I'm happy to discuss at our next live chat.

Now, unless there are new arguments, I suggest we wait for that.

h-vetinari commented 5 years ago

And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.

Then the consequence should be deprecating set_axis, but not having very different methods for almost duplicate concepts.

[...] but I also said I'm happy to discuss at our next live chat.

Is there a date/time already?

toobaz commented 5 years ago

Then the consequence should be deprecating set_axis, but not having very different methods for almost duplicate concepts.

Setting the index from columns and from data are two operations (I guess this is what you mean by "concepts") which are different enough (we have seen) as to raise the need to avoid ambiguity. It can be done through different functions, or different args, or by defining special cases users should be aware of, but we definitely agree they are distinct. Which of these solutions we pick is mostly not a matter of linguistics.

Is there a date/time already?

No

toobaz commented 5 years ago

No

(but just write me privately if you want to set up a chat with me)

wesm commented 5 years ago

I'd like to point out that the tone of this thread makes me a bit uncomfortable. As a reminder, this project has a code of conduct

https://github.com/pandas-dev/pandas-governance/blob/master/code-of-conduct.md

In such discussions, I think we (both maintainers and contributors) need to stick to facts and technical arguments and leave feelings and editorial comments out of the process. There is a risk in technical arguments to stoop to emotive conjugation (https://en.wikipedia.org/wiki/Emotive_conjugation) in describing others' actions.

In general my understanding is that this project operates on the basis of consensus-based decision making -- when there is no consensus about a change, the default option is probably to do nothing. In theory as the BDFL I can help settle disagreements, but I would prefer not to except in truly exceptional circumstances.

I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.

h-vetinari commented 5 years ago

@wesm Thanks for taking the time to respond here, although I regret that the reason was due to discomfort.

I have striven to avoid any emotionally charged words, but don't claim that I always succeeded. I believe all participants truly want the best for the combination of user- & maintainer-base, but such impassioned arguments take a lot of time and energy (which, I presume is the reason why many participants have not joined the discussion anymore).

I do object to the way some things were handled in this whole episode, but will not dwell on that. My main reasons for not resigning from this discussion are that I don't want the case dismissed without a fair hearing/counter-argument (even if I'm not core-dev), and that I feel the picture is much less one-sided even on the dev-side, as the impression that the last few comments might give.

I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.

I'd be happy to participate in another format, but didn't know a better way than through an issue here.

PS. Thanks for the link about emotive conjugation. "How would I describe myself in their shoes" will be an excellent self-check before speaking/posting.

toobaz commented 4 years ago

In the last dev chat, which was held just few minutes ago, this issue was discussed and there was clear consensus for option 4, that is, "deprecate using set_index with arrays, and point to set_axis instead".

(Related to the "tone" of the discussion, I definitely try to keep emotionally charged language away from my comments, but whenever should I fail to do so, I welcome being explicitly reprimanded - ideally in private. I'm not a native speaker, and in any case I definitely won't be offended by any such rebuke. On the other hand, language aside, I think that when a discussion takes more of our time and energy than it is worth, there is nothing wrong in stating it, even if it is not a "fact or technical argument" on pandas itself.)

jorisvandenbossche commented 4 years ago

@toobaz can you give some reasoning why this is the clear preference? I would need to go through this long thread the understand it more, but maybe some arguments were summarized on the call?

h-vetinari commented 4 years ago

@toobaz Thanks for seeing this through. Although I would have liked to participate in the dev chat about this, and although I find the decision suboptimal, I guess any decision is better than no decision at this point. Time permitting, I'll look into a PR that deprecates arrays from set_index and outputs a nice warning to use set_axis.

@jorisvandenbossche The lack of documentation and transparency (note: not an accusation, that's just the way it has been so far) in such cases is why I'm thinking about a pandas version of PEPs/NEPs (#28568). I have been swamped recently and couldn't respond on that issue, but I will pick it up again.

toobaz commented 4 years ago

@toobaz can you give some reasoning why this is the clear preference?

To be honest there was more a recall of the reasoning already exposed here than any new argument. Some devs (e.g., @WillAyd ) were already clearly in favour of option 4.

The only new thing was a proposal (by @TomAugspurger if I recall correctly) to deprecate set_index entirely, replacing with two methods set_index_keys and set_index_values, a clean, but more disruptive, solution. But in the end, consensus on 4 was reached pretty quickly.

@h-vetinari you probably already know, but just in case: the call was publicly announced on the [pydata] mailing list. In any case, again, if there had been new arguments I would have written them here. It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".

I'm thinking about a pandas version of PEPs/NEPs (#28568).

Wlll reply there.

h-vetinari commented 4 years ago

@toobaz I didn't know the announcement, thanks for the info.

@toobaz: It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".

I do not consider dispersed discussion in several threads and comments as appropriate documentation (again: not as a criticism of you or the other devs, but rather of the current process). I added a comment about this in #28568.

TomAugspurger commented 4 years ago

@h-vetinari The calls and meeting notes are public.

to deprecate set_index entirely

Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)

toobaz commented 4 years ago

Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)

OK, thanks for the clarification ;-)