Open h-vetinari opened 5 years ago
@h-vetinari should list('ABC')
in the first example be list(ABCD')
? If not then I am confused in several directions.
@jbrockmendel That was indeed an artefact from merging together several things from the other thread to make this issue...
@jreback Any comments here or in #22225?
@jreback I am honestly stunned by you closing #22225 and then locking it after I objected. So much for my motivation to work on some big PRs today.
@h-vetinari you are not listening. If you want to raise an issue or comment feel free.
I've opened this issue here for exactly this purpose (discussing your objections to existing capabilities of DataFrame.set_index
) over a month ago.
@h-vetinari While I think locking #22225 was an unnecessary move from @jreback , you have to realize that the "''overruling approving reviews''" thing is not a good argument to raise in such a discussion. True, in pandas we look for devs consensus, but in the end we prefer to do so by argumenting rather than by any form of proper vote. So that comment was not very useful, to use an euphemism.
Back to the topic of the discussion: if I understand correctly, @jreback is argumenting that df.set_index
is already ambiguous enough to make it a bad idea to sponsor its use when passing anything but keys (which would be the only sensible use in Series.set_index
); and at the same time, you are suggesting that
[ ] For Series, the
set_index
andset_axis
methods should be exactly the same.
If this summary of the discussion is correct, then I think I am also against introducing Series.set_index
.
And actually, I'm probably in favor of deprecating df.set_index
to pass actual values, if df.set_axis
is able to fulfill exactly the same task (didn't check).
I understand you argument about df.set_index
being what one expects to use to set the .index
(to some values)... but set_axis
is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming. Or in other words, I think
The
axis
-kwarg ofset_axis
should just switch between the behaviour ofset_index
(i.e. dealing with keys and array-likes) andset_columns
.
is a bad idea. ''There should be one - and preferably only one - obvious way to do it."
EDIT: By the way: sorry for not reacting before to the ping - busy period.
@h-vetinari In my opinion, the locking was not the best way to handle the discussion in the PR, so sorry about that. In the meantime, Jeff has unlocked the conversation there, but let's continue the discussion here.
On the topic: given the behaviour of DataFrame.set_index
(supporting setting a full array-like in addition of a list of column names), I personally don't have problems with adding a similar behaviour for Series.set_index
.
I'm probably in favor of deprecating df.set_index to pass actual values, if df.set_axis is able to fulfill exactly the same task .....
.... but set_axis is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming
@toobaz I personally almost never seen someone use set_axis
(and never used it myself), but I do regularly see people use set_index
with a full Series/array (but that's subjective of course).
So personally, I would rather deprecate set_axis
and only keep set_index
(but: set_axis
also has the ability to set the columns, which set_index
cannot do, so it is not fully duplicative and therefore probably cannot easily be deprecated).
Agreed that locking was not appropriate.
On the issue itself, to me it's pretty clear that Series.set_index(sequence)
is a limiting case of DataFrame.set_index(Sequece[sequence])
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
In [3]: df.set_index([['a', 'b', 'c']])
Out[3]:
A
a 1
b 2
c 3
Since In[3]
works, I would expect that
In [4]: df.A.set_index([['a', 'b', 'c']])
work as well.
I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat?
The implementation is fine, and deserves to go in 0.24.0 if we can agree on the desired behavior. No need to delay I think.
On Mon, Jan 7, 2019 at 10:09 AM William Ayd notifications@github.com wrote:
I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451986119, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIrAEcEJoHmYf7ZMXGa7rvVpAlYUFks5vA3EtgaJpZM4Y9ZRa .
There isn't agreeance on desired behavior hence why I suggest moving to a separate forum. I don't think it's something we need to push into 0.24 at the end here either
I really don't think this is a difficult decision though, is it? Do we want Series.set_index
to accept arrays like DataFrame.set_index
or not? Joris and I are I think +1, or at least ambivalent. Jeff seems to think that DataFrame.set_index doesn't accept arrays (e.g.
https://github.com/pandas-dev/pandas/pull/22225#issuecomment-451758441)
DataFrame.set_index()
which accepts keys (which are column names / levels). NOT an array of values.
which isn't correct, as shown in https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451921548.
I agree that if it seems difficult to find an agreement, that discussing this on the next dev chat can be more productive / effective. But, until now, there hasn't been much discussion on the PR, apart from between Jeff and h-vetinari (other people commented on the PR, but didn't really involve in the API discussion, apart from Tom agreeing with h-vetinari). So I would first still be interested in hearing what other people think about it, as I personally don't really see much reason to object it.
On the issue itself, to me it's pretty clear that
Series.set_index(sequence)
is a limiting case ofDataFrame.set_index(Sequece[sequence])
@TomAugspurger We all agree (I think) that it makes sense for df.set_index([a_list_of_labels])
to work. I think @jreback makes a good point however that there is no obvious reason (except parameter ambiguity) for df.set_index(a_list_of_labels)
not to work (since df.set_index(Series(a_list_of_labels))
does), and that this causes a potential confusion that df.set_axis
doesn't. Then maybe we can live with it... but let's admit this is not ideal.
One alternative (which I don't particularly like) is what (I think) .groupby(a_list)
does, i.e., trying to find elements of a_list
in the axis, and fallback to considering them as values otherwise.
I am -1 due to ambiguity. I don't know what the desired behavior of the following is:
df = pd.DataFrame(np.ones((3,3)))
df.set_index([1, 0, 2])
@toobaz can you show an example of df._set_index(a_list_of_labels)
vs. [a_list_of_labels]
? I don't think that #22225 is changing that at all.
@WillAyd that ambiguity exists today, and is unchanged by #22225. I don't think anyone has proposed deprecating that behavior.
I am -1 due to ambiguity. I don't know what the desired behavior of the following is:
That is not fully the discussion. As that is about a DataFrame, and that behaviour is already defined (it first prefers column names).
The question is rather what pd.Series([0, 0, 0]).set_index([1, 0, 2])
should do, which is much less ambiguous.
Given the confusion and talking next to each other, it might be good if someone attempts to make a good illustrated and complete summary of the actual discussion.
Is
https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451921548 a good summary? Make Series.set_index
the limiting case of DataFrame.set_index
? Any confusion points there?
@toobaz can you show an example of
df._set_index(a_list_of_labels)
vs.[a_list_of_labels]
? I don't think that #22225 is changing that at all.
No, it's not. But as already stated, if df.set_index(values_rather_than_keys)
is a regrettable legacy causing ambiguity in the API, we'd rather not enhance its usage by paralleling it with Series.set_index
, which would do only that (which is already done by Series.set_axis
). I actually suggested deprecating it... which might not be our final decision, but is certainly related to #22225 .
@toobaz my apologies, I missed the paragraph where you suggested deprecating non-labels a values in DataFrame.set_index
. Indeed, if we want to deprecate that then we should not go forward in #22225.
On deprecating passing values, rather that column labels to DataFrame.set_index
: I don't think we should deprecate that. While there is ambiguity, as noted in @WillAyd's example in
https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451991224, I think it's quite useful to pass a mix of labels and keys.
In [11]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
In [12]: df.set_index(["A", [1, 2, 3]])
Out[12]:
B
A
1 1 4
2 2 5
3 3 6
Without that, I think you'd have some ugly
In [18]: df.set_axis(pd.MultiIndex.from_arrays([df.A, [1, 2, 3]]), inplace=False).drop(['A'], axis=1)
Out[18]:
B
A
1 1 4
2 2 5
3 3 6
you could just raise / warn on ambguity
this needs coupling with possibly deprecating set_axis as well because passing values is not documented in any way
this needs coupling with possibly deprecating set_axis as well
I would personally be happy to get rid of set_axis
, but I think the main problem there is that we don't have an alternative for df.set_axis(['a', 'b'], axis=1)
(setting column names) ?
this needs coupling with possibly deprecating set_axis as well
Deprecating the one method that works as expected?!
I would personally be happy to get rid of
set_axis
, but I think the main problem there is that we don't have an alternative fordf.set_axis(['a', 'b'], axis=1)
(setting column names) ?
The problem is also for axis=0
... (unless you know about the nested list trick)
I think it's quite useful to pass a mix of labels and keys.
I would never do this in my code (I would rather add a column, use set_index(append=True)
and possibly reorder levels). But sure, in this case set_index
is strictly more powerful...
@toobaz: True, in pandas we look for devs consensus, but in the end we prefer to do so by argumenting rather than by any form of proper vote. [my emphasis]
That was exactly my point - locking an on-going discussion is not an argument. But enough of that, thanks for reverting the lock @jreback and the other voices who expressed support.
The other comments are quite divergent, so I won't address them one-by-one.
@jorisvandenbossche: Given the confusion and talking next to each other, it might be good if someone attempts to make a good illustrated and complete summary of the actual discussion.
I attempted as much in the OP (under the assumption that set_axis
would not be deprecated), but I'll try again, with the following list of statements (used as abbreviations in the table below):
Series.set_index
should exist, and accept array-likesDataFrame.set_index
should accept array-likesDF.set_index(list_of_scalars)
must be solved (orthogonal to #22225)
DataFrame.set_index
should allow a mix of array-likes and column keys.set_axis
could be deprecated in favour of set_index
DataFrame.set_columns
should be introduced for axis=0
case (see also some discussion in #14829)set_axis
could simply switch between set_index
and set_columns
depending on axis
-kwargThe following table is my best attempt at summarizing the existing positions. Apologies for any mistakes, and feel free to edit "your" line in this table or comment (+ / - / ?
are clear, and I use ~
as indifference):
@... thinks that ... | 1 | 2 | 3 | 3.i | 4 | 5 | 5.i | 6 |
---|---|---|---|---|---|---|---|---|
@h-vetinari | + | + | ~ | + | ~/+ | ~ | + | + |
@jorisvandenbossche | + | + | ~ | ? | ? | + | ? | ? |
@jreback | ~? | -? | + | + | ? | ~? | ? | ? |
@TomAugspurger | + | + | ~ | ? | + | ? | ? | ? |
@toobaz | - | ~ | ~ | ~ | ~ | - | - | - |
@WillAyd | - | - | + | - | - | - | - | - |
EDIT: just saw that there was a bunch of discussion on gitter that I didn't see. The content of that is not reflected in the table above.
Thanks for the summary @h-vetinari. I think you've accurately summarized my views on 1, 2, and 3. I don't have any thoughts on changing set_axis
right now.
I think it's quite useful to pass a mix of labels and keys.
I would never do this in my code
I probably wouldn't either :) But I can see it being useful.
I probably wouldn't either :) But I can see it being useful.
By the way: this is the only reason I see for not deprecating/discouraging passing values to DataFrame.set_index
... and it is a use case that Series.set_index
would not cover.
@jreback: this needs coupling with possibly deprecating set_axis as well because passing values is not documented in any way
It is documented (albeit scarcely) in v0.23.4
resp. master:
keys : column label or list of column labels / arrays
This parameter bullet has no description either, so that last "/ arrays" has to be counted as documentation for sure, IMO.
@toobaz: By the way: this is the only reason I see for not deprecating/discouraging passing values to
DataFrame.set_index
... and it is a use case thatSeries.set_index
would not cover.
Why not make DataFrame.set_axis(..., axis=0)
equal in capabilities to DataFrame.set_index
? The main part is set
ting the index
, and yes, for DataFrames
one can reasonably use existing columns instead of arrays. But how would this be contradicting the fundamental purpose of set_axis
? I mean, a frame has two axes, right...? Assuming someone thinks "I want to set
this axis
", why not let them set it with the same capabilities as set_index
?
To this end, I don't agree with your point further up the thread that having set_axis(..., axis=0) == set_index
[and set_axis(..., axis=1) == set_columns
] would violate the zen of python. Each of those methods has a clear and distinct purpose, but set_index
is more specialised than set_axis
(which is why one could easily dispatch from the latter to the former).
Why not make
DataFrame.set_axis(..., axis=0)
equal in capabilities toDataFrame.set_index
?
OK, I think we're (I'm, at least) getting lost... so forget for a moment the current API, assume we start from scratch. We want the following:
DataFrame.index
from single column keyDataFrame.index
from sequence of valuesDataFrame.index
from sequence of column keysDataFrame.index
from sequence of sequences of valuesAll of these are good. Problem: a list
of scalars is both a valid sequence of values (2.) and a potentially valid sequence of column keys (3.). There are two ways around the ambiguity: either (my preferred choice) we split functionalities 1. and 3. vs. 2. and 4. in two different methods, discarding 5., or we make an arbitrary choice and document it.
Now, back to the current API. We have set_index
that does all of this, we have the ambiguity between 3. and 3., resolved in favor of 3.I don't like this, but other devs suggest it's not so problematic, that 5. is worth keeping, and certainly we can at least document everything better.
Then we have a method which does cleanly precisely what it is expected to: set_axis
(it does 2. and 4.). Can we please at least agree on not making it ambiguous too, or deprecating it? Then we can continue the discussion on set_index
;-)
@toobaz: Now, back to the current API. We have
set_index
that does all of this, we have the ambiguity between 2. and 3., resolved in favor of 3. I don't like this, but other devs suggest it's not so problematic, that 5. is worth keeping, and certainly we can at least document everything better.
I think it's not so problematic if documented well. As you mentioned further up, groupby
does similar gymnastics. I forgot to add that option to the table above.
Then we have a method which does cleanly precisely what it is expected to:
set_axis
(it does 2. and 4.). Can we please at least agree on not making it ambiguous too, or deprecating it? Then we can continue the discussion onset_index
;-)
It's not central to the discussion here, but ideally, I'd like to solve the ambiguity and make set_axis
have the same capabilities. Because setting an index IS setting an axis (and if you want to distinguish 1./3. from 2./4. by method, then it should be with a name that supports this distinction). Actually, my ordered preference for set_axis
would be:
dispatch to set_index
> deprecate set_axis
entirely > keep as-is
I do think it's worth considering to keep 5, but I can't see a good way to get rid of the ambiguity while keeping it. If we're trying to get rid of the ambiguity first, then I think a good solution could be to add an arrays
-kwarg. The name "keys
" is not an accurate description of what's happening anyway:
Docstring could look like:
def set_index(keys=None, arrays=None, drop=True, append=False, inplace=False,
verify_integrity=False):
"""
Set the index (row labels) using one or more given labels or arrays.
keys : column label or list of column labels
arrays : array-like or list of array-likes
drop : ...
[...]
"""
if not (keys is None ^ arrays is None):
raise ValueError('must pass exactly one of `keys`/`arrays`')
# deprecate passing arrays as keys
When I evaluate an API change I try to achieve multiple objectives:
1) simple mental model 2) consistency
The most prevalent usecase DataFrame.set_index
is one to set keys
(from columns). This is a simple model and works well. Array values are barely documented and have a very limited set of test case, thus they are a rarely used case.
Adding Series.set_index
would shift the primary use case to be setting with values
. This IMHO violates consistency (2) because now if you are working with a DataFrame
you have different primary use-cases. This violates the mental model (1) as we now have 2 methods that do exactly the same thing for Series
, meaning .set_axis
, it also violates (2) as you now still have to know whether you are working with a Series
or a DataFrame
, which I gather is the impetus for having Series.set_index
in the first place.
I don't see a good way of reconciling this and thus don't see a good case of adding Series.set_index
. (I had suggested a possible way forward of deprecating Series.set_axis
to make room for Series.set_index
, which would fix my objection, but still leave us with inconsistencies between DataFrame
and Series
for use case. Adding keyword args to .set_index
is not a solution as this makes things more complicated.
As @toobaz notes above, if we were to start fresh we might do it differently, but back-compat here is paramount as its not very easy to change existing behavior.
Finally I would consider removing support for allowing arrays in DataFrame.set_index
. For the small use cases where this is actually useful there are quite a lot of good solutions that did not exist when pandas first existed, e.g.
DataFrame.assign(new_col=.....).set_index(['A', 'new_col'])
is quite idiomatic.
How do we know that setting arrays with DataFrame.set_index isn't common?
Would you say that, from a consistency point of view, accepting an array for Series.set_index is more consistent than not?
On Wed, Jan 9, 2019 at 6:01 AM Jeff Reback notifications@github.com wrote:
When I evaluate an API change I try to achieve multiple objectives:
- simple mental model
- consistency
The most prevalent usecase DataFrame.set_index is one to set keys (from columns). This is a simple model and works well. Array values are barely documented and have a very limited set of test case, thus they are a rarely used case.
Adding Series.set_index would shift the primary use case to be setting with values. This IMHO violates consistency (2) because now if you are working with a DataFrame you have different primary use-cases. This violates the mental model (1) as we now have 2 methods that do exactly the same thing for Series, meaning .set_axis, it also violates (2) as you now still have to know whether you are working with a Series or a DataFrame, which I gather is the impetus for having Series.set_index in the first place.
I don't see a good way of reconciling this and thus don't see a good case of adding Series.set_index. (I had suggested a possible way forward of deprecating Series.set_axis to make room for Series.set_index, which would fix my objection, but still leave us with inconsistencies between DataFrame and Series for use case. Adding keyword args to .set_index is not a solution as this makes things more complicated.
As @toobaz https://github.com/toobaz notes above, if we were to start fresh we might do it differently, but back-compat here is paramount as its not very easy to change existing behavior.
Finally I would consider removing support for allowing arrays in DataFrame.set_index. For the small use cases where this is actually useful there are quite a lot of good solutions that did not exist when pandas first existed, e.g.
DataFrame.assign(new_col=.....).set_index(['A', 'new_col']) is quite idiomatic.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24046#issuecomment-452672672, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIig6JhF8-4WPdl1LhMvtXkDf924cks5vBdoOgaJpZM4Y9ZRa .
@jreback Thanks for taking the time for a comprehensive answer.
The most prevalent usecase DataFrame.set_index is one to set keys (from columns). This is a simple model and works well. Array values are barely documented [...]
I agree with @TomAugspurger that we can't know the prevalence a-priori. df.set_index(df2.index)
is simply too intuitive, and just works (assuming the same length), without having a look in the docs at all. Same for df.set_index([df2.index, df3.some_column])
.
[...] and have a very limited set of test case, thus they are a rarely used case.
It is extensively tested, already before #22236 (but even more extensively since that PR) and #22486. In particular, all list-likes (except sets) are tested, and all combinations between list-likes, also crossed with column labels (i.e. case 5. of @toobaz' list above).
As @toobaz notes above, if we were to start fresh we might do it differently, but back-compat here is paramount as its not very easy to change existing behavior.
Of course I'm thinking of back-compat as well. How about the suggestion above to turn keys
and arrays
into kwargs? This would have to deprecate the mixed case, but would remove the ambiguity at least.
Finally I would consider removing support for allowing arrays in
DataFrame.set_index
. For the small use cases where this is actually useful there are quite a lot of good solutions that did not exist when pandas first existed, e.g.DataFrame.assign(new_col=.....).set_index(['A', 'new_col'])
is quite idiomatic.
Won't that cause extra copying in assign
? In any case, deprecating mixes of keys and arrays might be a good idea, but I strongly believe that arrays should not be deprecated (see above).
How do we know that setting arrays with DataFrame.set_index isn't common?
easy enough, its barely documents, and I have never seen a bug report about this.
Would you say that, from a consistency point of view, accepting an array for Series.set_index is more consistent than not?
no, this is very confusing because it is different than the primary usecase for Dataframe.set_index
.
@h-vetinari
I agree with @TomAugspurger that we can't know the prevalence a-priori. df.set_index(df2.index) is simply too intuitive, and just works (assuming the same length), without having a look in the docs at all. Same for df.set_index([df2.index, df3.some_column]).
Show me where this is in the documentation
Of course I'm thinking of back-compat as well. How about the suggestion above to turn keys and arrays into kwargs? This would have to deprecate the mixed case, but would remove the ambiguity at least.
no this is stricty worse and more confusing.
Won't that cause extra copying in assign? In any case, deprecating mixes of keys and arrays might be a good idea, but I strongly believe that arrays should not be deprecated (see above).
I am not concerned about performance; being consistent is paramount. I am not strongly for deprecating, though I think it would be better, because as I say we already have a way to do this, .set_axis
@jreback: Show me where this is in the documentation
0.12 - 0.20.3 (very likely even further back; 0.12 is the oldest one online):
>>> indexed_df = df.set_index(['A', 'B'])
>>> indexed_df2 = df.set_index(['A', [0, 1, 2, 0, 1, 2]])
>>> indexed_df3 = df.set_index([[0, 1, 2, 0, 1, 2]])
0.21.0 - master (last example):
Create a multi-index using a set of values and a column:
>> df.set_index([[1, 2, 3, 4], 'year']) month sale year 1 2012 1 55 2 2014 4 40 3 2013 7 84 4 2014 10 31
This makes me think that even @toobaz' case 5 might be too entrenched to deprecate, but what could also be done is just deprecating list-likes
that are not array-like
(like for str.cat
in #22264). I think this is my favourite solution so far, because it solves the ambiguity while maintaining essentially all functionality (i.e. passing keys / ndarray / Series / Index, or a list-like combination thereof).
Further options would be to just clearly document the current "list-gets-interpreted-as-keys", or add a groupby
-like fall-back that a list without any keys will be used as an array. In all three cases, there is no conflict with Series.set_index
supporting arrays.
@jreback: [...] I have never seen a bug report about this.
That could just as well be because it always either works or gives a clear error that the length does not match. (also see above: implemented since forever)
>>> pd.DataFrame(np.eye(2)).set_index(pd.Series([1, 2, 3]))
[...]
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
Let me extract that last idea a bit, lest it get lost in the larger post above:
How about deprecating list
-likes that are not array
-like? This would:
list_of_scalar
@jreback @jorisvandenbossche @TomAugspurger @toobaz @WillAyd
An implementation of that last suggestion is in #24697, to better be able to discuss this.
How about deprecating
list
-likes that are notarray
-like? This would:
I already replied in #24697 ... but I'm afraid more in general your proposal doesn't solve the ambiguity nicely. If in "list
-likes" you include lists, then
df.set_index([['a', 'b']])
now doesn't mean anything... which is (I guess) worse than it meaning "take these values". If in "list
-likes" you exclude lists (which, if I understand correctly, is what you do in #24697), then the ambiguity is unaffected.
@toobaz: If in "
list
-likes" you include lists, thendf.set_index([['a', 'b']])
now doesn't mean anything
That is indeed the idea (and how it is implemented in #24697). I'd say it's unclear what df.set_index([['a', 'b']])
is even supposed to mean (for people who don't know the nested-list trick), and the double braces are easy to overlook.
Lists, like tuples, are used for too many things IMO (causing ambiguities, "magic" methods, and code complexity). I'd also postulate that outside of toy examples, they're hardly used as replacements for arrays either. In most real-world cases, people already have Series/Index/arrays that they're juggling.
We've had a similar deprecation in #22264 for str.cat
, because it will make the code much easier to maintain, and does not sacrifice any essential functionality (it's very easy to wrap a correctly-sized list into an array/Series etc.).
After not getting a response in #24697, I'd like to ask again how to proceed here. Deprecating lists within lists would have IMO been the most elegant solution, and was the only complaint against adding Series.set_index
. Now that the decision has been taken to live with this ambiguity, can we please reopen #22225?
I waded into this before learning how contentious it is. I would like to propose the following compromise:
Series.set_index
with behavior matching current DataFrame.set_index
with respect to arrays. List of lists not allowed.set_axis
instead.set_index
for column keys forever. keep set_axis
for setting labels forever.set_index
The benefits are:
if set_axis
is frowned upon, the deprecation would users towards whatever method (relabel, relabel_axis, rename, whatever) should be the final resting place of this functionality. There's not much cost to making Series compatible with DataFrame's current behavior in this, and then taking steps to (finally) remove it for both.
It is vitally important to have a method-chaining friendly way to set a series index. I'd like to help make that happen. note that set_axis
is not very well known currently, and appears in the documentation only in the API reference section. If acceptable, I can fix that as well.
@pilkibun: I waded into this before learning how contentious it is.
Haha, me too. ;-)
Actually, a lot happened after #22225, namely the discussions in #24046 and #24697 are pertinent. The principal bone of contention (from my POV) was how to deal with the ambiguity of list_of_scalars
being interpreted as an array-like
vs. as a list_of_column_keys
. Personally, I would have deprecated list_of_scalars_as_an_array
, but that ultimately got rejected after non-public discussions of the core-devs.
In any case, the issue remains an sorely lacking feature IMO, as users are (and should be) discouraged to directly manipulate attributes (e.g. s.index = <index>
), and Series.set_index
is doubtlessly the most intuitive method for set
ting the index
(and also missing as an analogue for DF.set_index
). I do not think that set_axis
is an adequate replacement.
So far, all mentioned proposals boil down to the following:
Series.set_index
as in #22225 but live with the ambiguity of list_of_scalars
(resp. resolve that ambiguity towards list_of_column_keys
consistently)list_of_scalars_as_an_array
as in #24697 and then add Series.set_index
set_index
with arrays, and point to set_axis
instead [if I understand you correctly]keys
and arrays in the signature of
set_index`To me, only 2, 3 & 5 are reasonable solutions vis-à-vis user-friendliness.
EDIT: added a 5th option, which I had forgotten
I'm still in favor of 4, but I admit I got totally lost in the discussion(s)
Yes I think option 4 as well
What I'm suggesting is, for now, to ensure as much parity between DataFrame and Series set_index
. If a decision is later made to move some functionality elsewhere, it's no more difficult to deprecate both simultaneously. And If no agreement can be reached on that, I suggest it's acceptable to have the same functionality in both.
Note that set_index(array, append=true)
and set_index([array, array])
both create MultiIndexes, which is useful and not currently possible with set_axis
. We could add that later of course as prep for splitting it off. But there will still be a deprecation cycle, and in the meantime it's not much worse having arrays in DataFrame.set_index
and in Series.set_index
.
I have put together #27504 that closely follows DataFrame.set_index
and I'd like your support for it. +1/-1 for this suggested plan of action, please?
If a decision is later made to move some functionality elsewhere, it's no more difficult to deprecate both simultaneously
@pilkibun feel free to tell me if I misunderstood, but we certainly won't allow a feature which we plan to (probably) later deprecate
@toobaz: I'm still in favor of 4, but I admit I got totally lost in the discussion(s)
@WillAyd: Yes I think option 4 as well
I think whichever future pandas version were to remove arrays for set_index
would receive massive pushback - it is one of the most intuitive parts of the API, and just works.
Also, index is an axis, so not least from the point of consistency, I'd say that set_axis
should actually have the same capabilities as set_index
(with the only difference being an axis
-switch). It would be confusing (i.e. opposite of intuitive) to remember which of those extremely similar methods does what, if option 4 were pursued.
I think whichever future pandas version were to remove arrays for
set_index
would receive massive pushback - it is one of the most intuitive parts of the API, and just works.
It's trivial to replace with set_axis
, which also just works. The difference (again, if I remember correctly) is just that you cannot mix labels and arrays, but I really don't expect this to be a frequent use case.
In any case, I've never seen set_axis
used with arrays in real code. I won't say it never happens, but I definitely think it is a marginal use case.
Also, index is an axis, so not least from the point of consistency, I'd say that
set_axis
should actually have the same capabilities asset_index
(with the only difference being an axis-switch).
"consistency" is not to have duplicated methods with different names. The day that set_axis
was really just a subset of set_index
, we would probably just deprecate it entirely ;-)
What is definitely not consistent is that we have set_index
but not set_columns
. Not a huge problem, but another reason why we maybe don't want to sponsor set_index
over set_axis
.
@toobaz, you're still mixing up a few things (or maybe I need to write more clearly...? or you mistyped...?)
In any case, I've never seen
set_axis
used with arrays in real code. I won't say it never happens, but I definitely think it is a marginal use case.
The point is using arrays with set_INDEX
, and keeping that use, which is certainly not marginal. Not least because it is, by a country mile, the first thing people turn to to set the index (which is very reasonable with an array of the same lengths as the Series/DF).
It's trivial to replace with
set_axis
, which also just works. The difference (again, if I remember correctly) is just that you cannot mix labels and arrays, but I really don't expect this to be a frequent use case.
Trivial, but harmful, because it would be removing the most obvious method for the job (sacrificing this usability should require something that's demonstrably better- and I'd challenge that option 4 satisfies that).
Mixing column labels and arrays might actually be a marginal use case, but it's been documented since before v0.12, so I'll reserve judgement. It would be much less impact to deprecate this aspect only, but that does not solve the ambiguity of list_of_scalar
.
"consistency" is not to have duplicated methods with different names. The day that
set_axis
was really just a subset ofset_index
, we would probably just deprecate it entirely ;-)
It would be a superset (as in, encompassing both set_index
and a putative set_columns
, and could just be a switch between those). Although set_axis
would be the first of those three that I'd get rid of, if necessary, I think that setting an index/column/axis are all eminently obvious and useful operations, that deserve their own method. And since they perform the exact same task, they should also have the same interface (as far as reasonably possible; i.e. no column keys for Series).
What is definitely not consistent is that we have
set_index
but notset_columns
.
I'd argue (both above and in the OP), that this method should be added.
TLDR: An argument for option 4 would have to demonstrate a much greater user benefit for removing such fundamental functionality as using arrays in set_index
. So far, the only reason given was the (IMO much weaker argument) that list_of_scalars
is ambiguous, which could be solved with much less impact by deprecating list_of_scalars_as_array
(i.e. option 3).
This is coming out of a discussion that has stalled #22225 (which is about adding
.set_index
to Series, see #21684). The discussion has shifted away from what capabilities a putativeSeries.set_index
should have, but what capabilitiesdf.set_index
has currently.The main issue (for @jreback) is that
df.set_index
takes arrays:Further on:
I don't think I am confusing them. If I want to set the
.index
-attribute of a Series/DataFrame, then using.set_index
is the most reasonable name by far. If anything,set_axis
should be a superset ofset_index
(and a putativeset_columns
), that just switches between the two based on theaxis
-kwarg.More than that, the current capabilities of
df.set_index
are a proper superset ofdf.set_axis(axis=0)
*, in that it's possible to fillkeys
with only*Series
/Index
/ndarray
/list
etc.:** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e.
df.set_index([[0, 8, 3, 0]])
instead ofdf.set_index([0, 8, 3, 0])
. This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).Summing up:
set_index
is the most natural name for setting the.index
-attributedf.set_index
should be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).df.set_axis
should be able to do everything thatdf.set_index
does, and just switch between operating on index/columns based on theaxis
-kwarg (after all,index
andcolumns
are the two axes of a DF).set_columns
on aDataFrame
axis
-kwarg ofset_axis
should just switch between the behaviour ofset_index
(i.e. dealing with keys and array-likes) andset_columns
.Series.set_index
should support the same signature asdf.set_index
, with the exception of thedrop
-keyword (which only makes sense for column labels).set_index
andset_axis
methods should be exactly the same.Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.
EDIT: Forgot to add an xref from @jreback:
In that issue, there's discussion largely around
.rename
, and how to make that method more consistent. Also discussed was potentially introducing.relabel
, as well as.set_columns
.