Open h-vetinari opened 5 years ago
or you mistyped...?)
Yeah, I meant set_index
. You say set_index
is the first thing people turn to, and I totally agree. Then you say that doing it with arrays is "reasonable"... which is different ;-), and there's already a method for that.
It would be a _super_set
No,"deprecate it" = deprecate set_index
. (By the way, it was purely hypothetical reasoning)
For the rest, you make some confusion between removing functionality (the only functionality removed in my proposal would be the mixed case, which you admit is not so important) and just making the API more tidy, asking some users to replace a method name.
I'm all against 3 because I'm all against special cases (and I think I didn't even take part in the discussion mentioned in https://github.com/pandas-dev/pandas/pull/24697 !)
Edited
I should apologize for restarting this discussion. It was incredibly naive of me to think I could make a difference where so many have tried before and now I'm letting this go.
@toobaz: For the rest, you make some confusion between removing functionality [...] and just making the API more tidy
Well, df.set_index
currently supports arrays, and (IIUC) you want to remove that functionality.
@toobaz: Then you say that doing it with arrays is "reasonable"... which is different ;-), and there's already a method for that.
I disagree that it's unreasonable/different, and also that set_axis
can serve as a direct replacement; because it's not obvious except to people who know the arbitrary idiosyncrasy that for using arrays, they have to switch to set_axis
.
toobaz: I'm all against 3 because I'm all against special cases (and I think I didn't even take part in the discussion mentioned in #24697 !)
The "special-case" objection is fair enough, but it has to be weighed against the impact of other alternatives, namely ambiguity of list_of_scalar
, or removing core functionality (i.e. using arrays in set_index
). To me, both of the latter are worse resp. much worse than having a special case for the list, which has to be interpreted in too many places (leading to other special cases). This is very similar to how tuples are also overused (and about tuples you agree, AFAIK... #24702).
@pilkibun: Anyway, I should apologize for restarting this discussion. It was incredibly naive of me to think I could make a difference where so many have tried before and now I'm letting this go.
IMO you shouldn't apologize!
So from my side: thanks for having picked this up again (however briefly). ;-)
About terminology: let's not talk about "removing core functionality" when we mean reshaping the API by removing duplicates. The only functionality removal we are talking about here is the mixed case. Only by avoiding provocations we can try to reduce the sheer amount of noise in this issue.
To me, both of the latter are worse resp. much worse than having a special case for the list, which has to be interpreted in too many places (leading to other special cases). This is very similar to how tuples are also overused (and about tuples you agree, AFAIK... #24702).
I think you have a point if you are saying (not sure I understood correctly) that we tend to represent MultiIndex
keys with tuples, and that the content of nested lists are MultiIndex
keys. But I think there is a stronger point: where we accept arrays, we also accept lists; and if we accept 2D arrays, we will accept lists of lists. In other words, before data becomes the content of a MultiIndex
, it is "just data".
@toobaz: Only by avoiding provocations we can try to reduce the sheer amount of noise in this issue.
There's no provocation; I will however correct it if my argument is misrepresented (unintentionally).
About terminology: let's not talk about "removing core functionality" when we mean reshaping the API by removing duplicates. The only functionality removal we are talking about here is the mixed case.
I'm very sorry, but it is removing core functionality - namely "arrays in set_index
". It was an enhancement when it was added, and 7 years later, it is justifiable to call it core functionality (esp. considering how pandas is all about indexed data).
Talking about set_index
and set_axis
as "duplicates" is flawed IMO, because they deal with the exact same objects ("axes") - they are much less duplicates, rather than two sides of the same coin. If you want to deduplicate on the implementation side, you can make set_axis([...], axis=0)
use the implementation of set_index([...])
.
TLDR: Having different behaviour for set_index
and set_axis
is IMO bad API-design.
I think you have a point if you are saying [...]
My point was mostly different: lists - like tuples - are used in too many different roles, forcing many code paths for all the different scenarios in many places. Just as I support interpreting tuples always as ~scalars~ MultiIndex keys (and not, e.g. as array-likes; cf. #24702), I would argue that lists should preferably have the role of a sequence of objects (as in: "these columns are supposed to form a MultiIndex"), and also give up the array-like interpretation (which can be trivially replaced by Series/Index/np.ndarray/etc.).
I forgot to add a fifth option to the recap above: Turning the signature from
def set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False):
to
def set_index(keys=None, arrays=None, drop=True, append=False, inplace=False,
verify_integrity=False):
as suggested further up the thread. This would also sacrifice the mixed case, but make it very explicit what's being demanded (which would also resolve the ambiguity of list_of_scalar
).
PS. @pilkibun, adding materially new content to a comment through an edit makes it hard to see resp. respond to.
Sorry, I removed the new content and also most of the old.
Without arguing for one position over another, here's the state of things as seen from the user's point of view who looks to the documentation for guidance:
# grep -R '\.index' doc/source
whatsnew/v0.14.0.rst
158: df_multi.index = tuple_ind
164: df_multi.index = mi
whatsnew/v0.16.0.rst
98: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
user_guide/indexing.rst
1706: data.index = index
getting_started/10min.rst
653: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
user_guide/io.rst
1893: dfj2.index = pd.date_range('20130101', periods=5)
2952: df.index = df.index.set_names(['lvl1', 'lvl2'])
getting_started/basics.rst
233: dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),
user_guide/timeseries.rst
2096: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
user_guide/sparse.rst
306: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
At the same time, all example uses of set_index
are for the column case and there are 0 usage examples of set_axis
which appears only in the auto-generated reference.
I think it's fair to say that the docs currently advocate for always setting the index directly and gives no indication that setting a series index in a method-chain is supported. I care about this more than I care about which competing alternative wins out.
My personal preference is still to keep set_index
as it is and add its equivalent to Series
. And in any case to update the documentation accordingly.
I'm very sorry, but it is removing core functionality
You're not trying to understand.
give up the array-like interpretation
I'm pretty sure this will never pass.
@h-vetinari: I'm very sorry, but it is removing core functionality
@toobaz: You're not trying to understand.
I understand that, to you, the functionality is not removed because it lives on in set_axis
. We just disagree about what constitutes good API design.
@h-vetinari: give up the array-like interpretation
@toobaz: I'm pretty sure this will never pass.
It doesn't have to be everywhere or at the same time, but removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability. I'm saying it could be a goal to strive for, like striving to have tuples always be MI-keys (and was an argument why "special-casing" in option 3 can be desirable).
@pilkibun: [...] and there are 0 usage examples of
set_axis
which appears only in the auto-generated reference.
@toobaz, this is part of the reason why I consider shifting the array-capability from set_index
to set_axis
as a removal of functionality - it is not nearly as intuitive or wide-spread as set_index
.
We just disagree about what constitutes good API design.
We mostly disagree on what constitutes productive interaction. And from my part, this is the last "metacomment" in this discussion.
removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability
I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index
.
So among the options you listed, only 4 and 5 are feasible. To be honest I don't like 5, but this might be subjective. Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?
I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index.
I agree that users should not pay the price. I also don't want the overloaded interpretation in set_index
, but would tackle it differently.
Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?
As a deprecation, yes. After the removal, not so much (because set_index
will still be the more obvious choice, also for arrays).
So among the options you listed, only 4 and 5 are feasible.
I think 3 and 5 are feasible. The overlap is 5, and I'd be happy to support that.
We mostly disagree on what constitutes productive interaction.
Indeed, and we both have our failings there. Thanks for taking the time/energy to stick with it.
because
set_index
will still be the more obvious choice, also for arrays
Why "obvious"? For the name "set_index
" vs. "set_axis
"? Or because you like everything in the same method?
(either way, I think it's just a matter of having clear documentation)
I think 3 and 5 are feasible.
3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments. I guess we never discussed much 5, so I'll be happy to mention it in the next live chat. I still think we will end up choosing 4.
Why "obvious"? For the name "
set_index
" vs. "set_axis
"?
That plays a large role, yes, because we're set
-ting the index
-attribute, and because having an intuitive API is crucial. Documentation does not help with intuitiveness.
3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments.
I have enough arguments**, but so far you have hardly responded to them, except by appeal to authority. 4 or 5 is your opinion, 3 or 5 is mine. Of course, you're in a much better position to enforce your ideas, but that does not improve the strength of your argument.
Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.
I realised I haven't even fully articulated one of the most important arguments against 4: saying that arrays can only be used in set_axis
does not solve the lack of a set_index
-method for Series**! The rest of my main arguments is recapped below the fold:
appeal to authority
This is a way to waste the time of both of us. So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things.
But since you later repeated your arguments in a tidy way, I will for the last time to assume you're trying to be constructive, and repeat my objections to your arguments. But neither your arguments nor my replies are new - and my replies are not just mine, other devs contributed to this and related discussions. If you want to just discuss the points below over and over, let's do this in a live chat so at least we wast less time.
that "arrays-in-
set_index
" was a specific enhancement
The fact that an ability is added later on does not mean it cannot be deprecated (actually the opposite).
and that this is core functionality not least due to its intuitiveness and centrality of indexes to pandas
"setting index with arrays" is a functionality (not so sure it is even a "core" one - setting from columns is much more frequent in the code I see), having it in set_index
is just a matter of organizing the API. axis
is not a strange term in pandas, it is one of the core concepts, so while index
might be even more immediate, we are definitely not hiding the functionality by keeping it in set_axis
.
that axis/index are two sides of the same coin (and not two duplicate coins), and that having different behaviours between
set_index
andset_axis
is further detriment to intuitiveness
Your argument would suggest we want to make them almost-duplicates (apart from the axis=
argument), and this is precisely something any good API should avoid. Vice-versa, separating the functionalities allows for less ambiguity (no, "ambiguity" is not an argument against option 5, but having a single method do widely different things based on arguments is not a favor to our users).
that deprecating
list-as-array
(vs.list-as-collection
) is a goal to strive for, much more than a special case
As already clearly stated: you are right this would simplify our life, but we don't want our users to pay for. We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now. As for terminology, 2D arrays are collections of 1D arrays, which are collections of elements, so lists of lists make perfect sense.
saying that arrays can only be used in
set_axis
does not solve the lack of aset_index
-method for Series
There is nothing to "solve" if the set_index
does something which is completely useless for Series
- we have set_axis
that does the useful part. And we have discussed this some time ago. This said, if we do find that some sort of Series.set_index
is useful for compatibility (like the MultiIndex
methods which were backported to flat Index
es as idempotent), I will have no general objections. But it will be a result of, not an argument for, our decision about DataFrame.set_index
, and it is OT here.
So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things. [...] I will for the last time to assume you're trying to be constructive, [...]
I appreciate the time you took, and thanks for that. Indeed I was trying to have a productive discussion, and what seemed like rants to you was my response to the perception that you dismissed my points out of hand (until your last reply).
Your argument would suggest we want to make them almost-duplicates (apart from the
axis=
argument), and this is precisely something any good API should avoid.
The point is that the concepts axis
and index
are "almost-duplicates" already, and so their methods should reflect that. Thinking of the methods as duplicate is having things backwards, because the concept comes before the method.
We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now.
Except maybe if it is the cause of the hotly-contested ambiguity of list_of_scalar
that stalled this whole discussion since about a year. It's clear that there are no great choices here, but leaving users to decipher the difference between set_index([0, 1, 2])
and set_index([[0, 1, 2]])
is not amazing either.
Anyway, thanks for taking the time to respond. I'll note that you didn't react to my attempt at extending an olive branch though:
@h-vetinari: Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.
The point is that the concepts
axis
andindex
are "almost-duplicates" already, and so their methods should reflect that
In API design, "adherence to language" is only one of the many arguments - and an argument which is not new in this discussion. And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.
the hotly-contested ambiguity of
list_of_scalar
that stalled this whole discussion since about a year
You seem to think that this problem is important because it stalled for a year - maybe it stalled for a year because there were more important things ;-)
It's clear that there are no great choices here, but leaving users to decipher the difference between
set_index([0, 1, 2])
andset_index([[0, 1, 2]])
is not amazing either.
Flat lists and nested lists should be clearly different objects to our users.
I'll note that you didn't react to my attempt at extending an olive branch though
I did. I said I don't like 5, I said why, but I also said I'm happy to discuss at our next live chat.
Now, unless there are new arguments, I suggest we wait for that.
And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.
Then the consequence should be deprecating set_axis
, but not having very different methods for almost duplicate concepts.
[...] but I also said I'm happy to discuss at our next live chat.
Is there a date/time already?
Then the consequence should be deprecating
set_axis
, but not having very different methods for almost duplicate concepts.
Setting the index from columns and from data are two operations (I guess this is what you mean by "concepts") which are different enough (we have seen) as to raise the need to avoid ambiguity. It can be done through different functions, or different args, or by defining special cases users should be aware of, but we definitely agree they are distinct. Which of these solutions we pick is mostly not a matter of linguistics.
Is there a date/time already?
No
No
(but just write me privately if you want to set up a chat with me)
I'd like to point out that the tone of this thread makes me a bit uncomfortable. As a reminder, this project has a code of conduct
https://github.com/pandas-dev/pandas-governance/blob/master/code-of-conduct.md
In such discussions, I think we (both maintainers and contributors) need to stick to facts and technical arguments and leave feelings and editorial comments out of the process. There is a risk in technical arguments to stoop to emotive conjugation (https://en.wikipedia.org/wiki/Emotive_conjugation) in describing others' actions.
In general my understanding is that this project operates on the basis of consensus-based decision making -- when there is no consensus about a change, the default option is probably to do nothing. In theory as the BDFL I can help settle disagreements, but I would prefer not to except in truly exceptional circumstances.
I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.
@wesm Thanks for taking the time to respond here, although I regret that the reason was due to discomfort.
I have striven to avoid any emotionally charged words, but don't claim that I always succeeded. I believe all participants truly want the best for the combination of user- & maintainer-base, but such impassioned arguments take a lot of time and energy (which, I presume is the reason why many participants have not joined the discussion anymore).
I do object to the way some things were handled in this whole episode, but will not dwell on that. My main reasons for not resigning from this discussion are that I don't want the case dismissed without a fair hearing/counter-argument (even if I'm not core-dev), and that I feel the picture is much less one-sided even on the dev-side, as the impression that the last few comments might give.
I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.
I'd be happy to participate in another format, but didn't know a better way than through an issue here.
PS. Thanks for the link about emotive conjugation. "How would I describe myself in their shoes" will be an excellent self-check before speaking/posting.
In the last dev chat, which was held just few minutes ago, this issue was discussed and there was clear consensus for option 4, that is, "deprecate using set_index with arrays, and point to set_axis instead".
(Related to the "tone" of the discussion, I definitely try to keep emotionally charged language away from my comments, but whenever should I fail to do so, I welcome being explicitly reprimanded - ideally in private. I'm not a native speaker, and in any case I definitely won't be offended by any such rebuke. On the other hand, language aside, I think that when a discussion takes more of our time and energy than it is worth, there is nothing wrong in stating it, even if it is not a "fact or technical argument" on pandas itself.)
@toobaz can you give some reasoning why this is the clear preference? I would need to go through this long thread the understand it more, but maybe some arguments were summarized on the call?
@toobaz Thanks for seeing this through. Although I would have liked to participate in the dev chat about this, and although I find the decision suboptimal, I guess any decision is better than no decision at this point. Time permitting, I'll look into a PR that deprecates arrays from set_index and outputs a nice warning to use set_axis.
@jorisvandenbossche The lack of documentation and transparency (note: not an accusation, that's just the way it has been so far) in such cases is why I'm thinking about a pandas version of PEPs/NEPs (#28568). I have been swamped recently and couldn't respond on that issue, but I will pick it up again.
@toobaz can you give some reasoning why this is the clear preference?
To be honest there was more a recall of the reasoning already exposed here than any new argument. Some devs (e.g., @WillAyd ) were already clearly in favour of option 4.
The only new thing was a proposal (by @TomAugspurger if I recall correctly) to deprecate set_index
entirely, replacing with two methods set_index_keys
and set_index_values
, a clean, but more disruptive, solution. But in the end, consensus on 4 was reached pretty quickly.
@h-vetinari you probably already know, but just in case: the call was publicly announced on the [pydata] mailing list. In any case, again, if there had been new arguments I would have written them here. It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".
I'm thinking about a pandas version of PEPs/NEPs (#28568).
Wlll reply there.
@toobaz I didn't know the announcement, thanks for the info.
@toobaz: It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".
I do not consider dispersed discussion in several threads and comments as appropriate documentation (again: not as a criticism of you or the other devs, but rather of the current process). I added a comment about this in #28568.
@h-vetinari The calls and meeting notes are public.
to deprecate set_index entirely
Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)
Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)
OK, thanks for the clarification ;-)
This is coming out of a discussion that has stalled #22225 (which is about adding
.set_index
to Series, see #21684). The discussion has shifted away from what capabilities a putativeSeries.set_index
should have, but what capabilitiesdf.set_index
has currently.The main issue (for @jreback) is that
df.set_index
takes arrays:Further on:
I don't think I am confusing them. If I want to set the
.index
-attribute of a Series/DataFrame, then using.set_index
is the most reasonable name by far. If anything,set_axis
should be a superset ofset_index
(and a putativeset_columns
), that just switches between the two based on theaxis
-kwarg.More than that, the current capabilities of
df.set_index
are a proper superset ofdf.set_axis(axis=0)
*, in that it's possible to fillkeys
with only*Series
/Index
/ndarray
/list
etc.:** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e.
df.set_index([[0, 8, 3, 0]])
instead ofdf.set_index([0, 8, 3, 0])
. This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).Summing up:
set_index
is the most natural name for setting the.index
-attributedf.set_index
should be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).df.set_axis
should be able to do everything thatdf.set_index
does, and just switch between operating on index/columns based on theaxis
-kwarg (after all,index
andcolumns
are the two axes of a DF).set_columns
on aDataFrame
axis
-kwarg ofset_axis
should just switch between the behaviour ofset_index
(i.e. dealing with keys and array-likes) andset_columns
.Series.set_index
should support the same signature asdf.set_index
, with the exception of thedrop
-keyword (which only makes sense for column labels).set_index
andset_axis
methods should be exactly the same.Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.
EDIT: Forgot to add an xref from @jreback:
In that issue, there's discussion largely around
.rename
, and how to make that method more consistent. Also discussed was potentially introducing.relabel
, as well as.set_columns
.