pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.67k stars 17.59k forks source link

API: capabilities of df.set_index #24046

Open h-vetinari opened 5 years ago

h-vetinari commented 5 years ago

This is coming out of a discussion that has stalled #22225 (which is about adding .set_index to Series, see #21684). The discussion has shifted away from what capabilities a putative Series.set_index should have, but what capabilities df.set_index has currently.

The main issue (for @jreback) is that df.set_index takes arrays:

@jreback: There were several attempts to have DataFrame.set_index take an array as well, but these never got off the ground.

@h-vetinari: I'm not sure when, but they certainly did get off the ground:

>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.23.4'
>>>
>>> df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), columns=list('abcd'))
>>> df.set_index(['a',          # label
...               df.index,     # Index
...               df.b ** 2,    # Series
...               df.b.values,  # ndarray
...               list('ABCD'), # list
...               'c'])         # label again
              b  d
a   b      c
0 0 0  2 A 1  0  2
8 1 1  4 B 4  1  4
3 2 25 5 C 8  5  5
0 3 9  7 D 2  3  7

Further on:

@jreback: @h-vetinari you are confusing the purpose of .set_axis. [...] The problem with .set_index on a DataFrame with an array is that it technically can work with an array and not keys. (meaning its not unambiguous)

I don't think I am confusing them. If I want to set the .index-attribute of a Series/DataFrame, then using .set_index is the most reasonable name by far. If anything, set_axis should be a superset of set_index (and a putative set_columns), that just switches between the two based on the axis-kwarg.

More than that, the current capabilities of df.set_index are a proper superset of df.set_axis(axis=0)*, in that it's possible to fill keys with only* Series/Index/ndarray/list etc.:

>>> df.set_index(pd.Index(df.a))  # same result as Series directly below
>>> df.set_index(df.a) 
   a  b  c  d
a
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7
>>> df.set_index(df.a.values)  # same result as list directly below
>>> df.set_index([[0, 8, 3, 0]])
   a  b  c  d
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7

** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e. df.set_index([[0, 8, 3, 0]]) instead of df.set_index([0, 8, 3, 0]). This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).

Summing up:

Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.

EDIT: Forgot to add an xref from @jreback:

@h-vetinari we had quite some discussion about this: #14829 and never reached resolution. This is an API question.

In that issue, there's discussion largely around .rename, and how to make that method more consistent. Also discussed was potentially introducing .relabel, as well as .set_columns.

jbrockmendel commented 5 years ago

@h-vetinari should list('ABC') in the first example be list(ABCD')? If not then I am confused in several directions.

h-vetinari commented 5 years ago

@jbrockmendel That was indeed an artefact from merging together several things from the other thread to make this issue...

h-vetinari commented 5 years ago

@jreback Any comments here or in #22225?

h-vetinari commented 5 years ago

@jreback I am honestly stunned by you closing #22225 and then locking it after I objected. So much for my motivation to work on some big PRs today.

@h-vetinari you are not listening. If you want to raise an issue or comment feel free.

I've opened this issue here for exactly this purpose (discussing your objections to existing capabilities of DataFrame.set_index) over a month ago.

toobaz commented 5 years ago

@h-vetinari While I think locking #22225 was an unnecessary move from @jreback , you have to realize that the "''overruling approving reviews''" thing is not a good argument to raise in such a discussion. True, in pandas we look for devs consensus, but in the end we prefer to do so by argumenting rather than by any form of proper vote. So that comment was not very useful, to use an euphemism.

Back to the topic of the discussion: if I understand correctly, @jreback is argumenting that df.set_index is already ambiguous enough to make it a bad idea to sponsor its use when passing anything but keys (which would be the only sensible use in Series.set_index); and at the same time, you are suggesting that

[ ] For Series, the set_index and set_axis methods should be exactly the same.

If this summary of the discussion is correct, then I think I am also against introducing Series.set_index.

And actually, I'm probably in favor of deprecating df.set_index to pass actual values, if df.set_axis is able to fulfill exactly the same task (didn't check).

I understand you argument about df.set_index being what one expects to use to set the .index (to some values)... but set_axis is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming. Or in other words, I think

The axis-kwarg of set_axis should just switch between the behaviour of set_index (i.e. dealing with keys and array-likes) and set_columns.

is a bad idea. ''There should be one - and preferably only one - obvious way to do it."

EDIT: By the way: sorry for not reacting before to the ping - busy period.

jorisvandenbossche commented 5 years ago

@h-vetinari In my opinion, the locking was not the best way to handle the discussion in the PR, so sorry about that. In the meantime, Jeff has unlocked the conversation there, but let's continue the discussion here.


On the topic: given the behaviour of DataFrame.set_index (supporting setting a full array-like in addition of a list of column names), I personally don't have problems with adding a similar behaviour for Series.set_index.

I'm probably in favor of deprecating df.set_index to pass actual values, if df.set_axis is able to fulfill exactly the same task .....
.... but set_axis is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming

@toobaz I personally almost never seen someone use set_axis (and never used it myself), but I do regularly see people use set_index with a full Series/array (but that's subjective of course). So personally, I would rather deprecate set_axis and only keep set_index (but: set_axis also has the ability to set the columns, which set_index cannot do, so it is not fully duplicative and therefore probably cannot easily be deprecated).

TomAugspurger commented 5 years ago

Agreed that locking was not appropriate.


On the issue itself, to me it's pretty clear that Series.set_index(sequence) is a limiting case of DataFrame.set_index(Sequece[sequence])

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 2, 3]})

In [3]: df.set_index([['a', 'b', 'c']])
Out[3]:
   A
a  1
b  2
c  3

Since In[3] works, I would expect that

In [4]: df.A.set_index([['a', 'b', 'c']])

work as well.

WillAyd commented 5 years ago

I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat?

TomAugspurger commented 5 years ago

The implementation is fine, and deserves to go in 0.24.0 if we can agree on the desired behavior. No need to delay I think.

On Mon, Jan 7, 2019 at 10:09 AM William Ayd notifications@github.com wrote:

I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451986119, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIrAEcEJoHmYf7ZMXGa7rvVpAlYUFks5vA3EtgaJpZM4Y9ZRa .

WillAyd commented 5 years ago

There isn't agreeance on desired behavior hence why I suggest moving to a separate forum. I don't think it's something we need to push into 0.24 at the end here either

TomAugspurger commented 5 years ago

I really don't think this is a difficult decision though, is it? Do we want Series.set_index to accept arrays like DataFrame.set_index or not? Joris and I are I think +1, or at least ambivalent. Jeff seems to think that DataFrame.set_index doesn't accept arrays (e.g. https://github.com/pandas-dev/pandas/pull/22225#issuecomment-451758441)

DataFrame.set_index() which accepts keys (which are column names / levels). NOT an array of values.

which isn't correct, as shown in https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451921548.

jorisvandenbossche commented 5 years ago

I agree that if it seems difficult to find an agreement, that discussing this on the next dev chat can be more productive / effective. But, until now, there hasn't been much discussion on the PR, apart from between Jeff and h-vetinari (other people commented on the PR, but didn't really involve in the API discussion, apart from Tom agreeing with h-vetinari). So I would first still be interested in hearing what other people think about it, as I personally don't really see much reason to object it.

toobaz commented 5 years ago

On the issue itself, to me it's pretty clear that Series.set_index(sequence) is a limiting case of DataFrame.set_index(Sequece[sequence])

@TomAugspurger We all agree (I think) that it makes sense for df.set_index([a_list_of_labels]) to work. I think @jreback makes a good point however that there is no obvious reason (except parameter ambiguity) for df.set_index(a_list_of_labels) not to work (since df.set_index(Series(a_list_of_labels)) does), and that this causes a potential confusion that df.set_axis doesn't. Then maybe we can live with it... but let's admit this is not ideal.

One alternative (which I don't particularly like) is what (I think) .groupby(a_list) does, i.e., trying to find elements of a_list in the axis, and fallback to considering them as values otherwise.

WillAyd commented 5 years ago

I am -1 due to ambiguity. I don't know what the desired behavior of the following is:

df = pd.DataFrame(np.ones((3,3)))
df.set_index([1, 0, 2])
TomAugspurger commented 5 years ago

@toobaz can you show an example of df._set_index(a_list_of_labels) vs. [a_list_of_labels]? I don't think that #22225 is changing that at all.

@WillAyd that ambiguity exists today, and is unchanged by #22225. I don't think anyone has proposed deprecating that behavior.

jorisvandenbossche commented 5 years ago

I am -1 due to ambiguity. I don't know what the desired behavior of the following is:

That is not fully the discussion. As that is about a DataFrame, and that behaviour is already defined (it first prefers column names). The question is rather what pd.Series([0, 0, 0]).set_index([1, 0, 2]) should do, which is much less ambiguous.

Given the confusion and talking next to each other, it might be good if someone attempts to make a good illustrated and complete summary of the actual discussion.

TomAugspurger commented 5 years ago

Is https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451921548 a good summary? Make Series.set_index the limiting case of DataFrame.set_index? Any confusion points there?

toobaz commented 5 years ago

@toobaz can you show an example of df._set_index(a_list_of_labels) vs. [a_list_of_labels]? I don't think that #22225 is changing that at all.

No, it's not. But as already stated, if df.set_index(values_rather_than_keys) is a regrettable legacy causing ambiguity in the API, we'd rather not enhance its usage by paralleling it with Series.set_index, which would do only that (which is already done by Series.set_axis). I actually suggested deprecating it... which might not be our final decision, but is certainly related to #22225 .

TomAugspurger commented 5 years ago

@toobaz my apologies, I missed the paragraph where you suggested deprecating non-labels a values in DataFrame.set_index. Indeed, if we want to deprecate that then we should not go forward in #22225.

TomAugspurger commented 5 years ago

On deprecating passing values, rather that column labels to DataFrame.set_index: I don't think we should deprecate that. While there is ambiguity, as noted in @WillAyd's example in https://github.com/pandas-dev/pandas/issues/24046#issuecomment-451991224, I think it's quite useful to pass a mix of labels and keys.

In [11]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

In [12]: df.set_index(["A", [1, 2, 3]])
Out[12]:
     B
A
1 1  4
2 2  5
3 3  6

Without that, I think you'd have some ugly

In [18]: df.set_axis(pd.MultiIndex.from_arrays([df.A, [1, 2, 3]]), inplace=False).drop(['A'], axis=1)
Out[18]:
     B
A
1 1  4
2 2  5
3 3  6
jreback commented 5 years ago

you could just raise / warn on ambguity

jreback commented 5 years ago

this needs coupling with possibly deprecating set_axis as well because passing values is not documented in any way

jorisvandenbossche commented 5 years ago

this needs coupling with possibly deprecating set_axis as well

I would personally be happy to get rid of set_axis, but I think the main problem there is that we don't have an alternative for df.set_axis(['a', 'b'], axis=1) (setting column names) ?

toobaz commented 5 years ago

this needs coupling with possibly deprecating set_axis as well

Deprecating the one method that works as expected?!

I would personally be happy to get rid of set_axis, but I think the main problem there is that we don't have an alternative for df.set_axis(['a', 'b'], axis=1) (setting column names) ?

The problem is also for axis=0... (unless you know about the nested list trick)

toobaz commented 5 years ago

I think it's quite useful to pass a mix of labels and keys.

I would never do this in my code (I would rather add a column, use set_index(append=True) and possibly reorder levels). But sure, in this case set_index is strictly more powerful...

h-vetinari commented 5 years ago

@toobaz: True, in pandas we look for devs consensus, but in the end we prefer to do so by argumenting rather than by any form of proper vote. [my emphasis]

That was exactly my point - locking an on-going discussion is not an argument. But enough of that, thanks for reverting the lock @jreback and the other voices who expressed support.


The other comments are quite divergent, so I won't address them one-by-one.

@jorisvandenbossche: Given the confusion and talking next to each other, it might be good if someone attempts to make a good illustrated and complete summary of the actual discussion.

I attempted as much in the OP (under the assumption that set_axis would not be deprecated), but I'll try again, with the following list of statements (used as abbreviations in the table below):

  1. Series.set_index should exist, and accept array-likes
  2. DataFrame.set_index should accept array-likes
  3. existing ambiguity of DF.set_index(list_of_scalars) must be solved (orthogonal to #22225)
    1. issue warning in case of ambiguity
  4. DataFrame.set_index should allow a mix of array-likes and column keys
  5. .set_axis could be deprecated in favour of set_index
    1. DataFrame.set_columns should be introduced for axis=0 case (see also some discussion in #14829)
  6. set_axis could simply switch between set_index and set_columns depending on axis-kwarg

The following table is my best attempt at summarizing the existing positions. Apologies for any mistakes, and feel free to edit "your" line in this table or comment (+ / - / ? are clear, and I use ~ as indifference):

@... thinks that ... 1 2 3 3.i 4 5 5.i 6
@h-vetinari + + ~ + ~/+ ~ + +
@jorisvandenbossche + + ~ ? ? + ? ?
@jreback ~? -? + + ? ~? ? ?
@TomAugspurger + + ~ ? + ? ? ?
@toobaz - ~ ~ ~ ~ - - -
@WillAyd - - + - - - - -

EDIT: just saw that there was a bunch of discussion on gitter that I didn't see. The content of that is not reflected in the table above.

TomAugspurger commented 5 years ago

Thanks for the summary @h-vetinari. I think you've accurately summarized my views on 1, 2, and 3. I don't have any thoughts on changing set_axis right now.

I think it's quite useful to pass a mix of labels and keys.

I would never do this in my code

I probably wouldn't either :) But I can see it being useful.

toobaz commented 5 years ago

I probably wouldn't either :) But I can see it being useful.

By the way: this is the only reason I see for not deprecating/discouraging passing values to DataFrame.set_index... and it is a use case that Series.set_index would not cover.

h-vetinari commented 5 years ago

@jreback: this needs coupling with possibly deprecating set_axis as well because passing values is not documented in any way

It is documented (albeit scarcely) in v0.23.4 resp. master:

keys : column label or list of column labels / arrays

This parameter bullet has no description either, so that last "/ arrays" has to be counted as documentation for sure, IMO.


@toobaz: By the way: this is the only reason I see for not deprecating/discouraging passing values to DataFrame.set_index... and it is a use case that Series.set_index would not cover.

Why not make DataFrame.set_axis(..., axis=0) equal in capabilities to DataFrame.set_index? The main part is setting the index, and yes, for DataFrames one can reasonably use existing columns instead of arrays. But how would this be contradicting the fundamental purpose of set_axis? I mean, a frame has two axes, right...? Assuming someone thinks "I want to set this axis", why not let them set it with the same capabilities as set_index?

To this end, I don't agree with your point further up the thread that having set_axis(..., axis=0) == set_index [and set_axis(..., axis=1) == set_columns] would violate the zen of python. Each of those methods has a clear and distinct purpose, but set_index is more specialised than set_axis (which is why one could easily dispatch from the latter to the former).

toobaz commented 5 years ago

Why not make DataFrame.set_axis(..., axis=0) equal in capabilities to DataFrame.set_index?

OK, I think we're (I'm, at least) getting lost... so forget for a moment the current API, assume we start from scratch. We want the following:

  1. set DataFrame.index from single column key
  2. set DataFrame.index from sequence of values
  3. set DataFrame.index from sequence of column keys
  4. set DataFrame.index from sequence of sequences of values
  5. mix 3. and 4.

All of these are good. Problem: a list of scalars is both a valid sequence of values (2.) and a potentially valid sequence of column keys (3.). There are two ways around the ambiguity: either (my preferred choice) we split functionalities 1. and 3. vs. 2. and 4. in two different methods, discarding 5., or we make an arbitrary choice and document it.

Now, back to the current API. We have set_index that does all of this, we have the ambiguity between 3. and 3., resolved in favor of 3.I don't like this, but other devs suggest it's not so problematic, that 5. is worth keeping, and certainly we can at least document everything better.

Then we have a method which does cleanly precisely what it is expected to: set_axis (it does 2. and 4.). Can we please at least agree on not making it ambiguous too, or deprecating it? Then we can continue the discussion on set_index ;-)

h-vetinari commented 5 years ago

@toobaz: Now, back to the current API. We have set_index that does all of this, we have the ambiguity between 2. and 3., resolved in favor of 3. I don't like this, but other devs suggest it's not so problematic, that 5. is worth keeping, and certainly we can at least document everything better.

I think it's not so problematic if documented well. As you mentioned further up, groupby does similar gymnastics. I forgot to add that option to the table above.

Then we have a method which does cleanly precisely what it is expected to: set_axis (it does 2. and 4.). Can we please at least agree on not making it ambiguous too, or deprecating it? Then we can continue the discussion on set_index ;-)

It's not central to the discussion here, but ideally, I'd like to solve the ambiguity and make set_axis have the same capabilities. Because setting an index IS setting an axis (and if you want to distinguish 1./3. from 2./4. by method, then it should be with a name that supports this distinction). Actually, my ordered preference for set_axis would be: dispatch to set_index > deprecate set_axis entirely > keep as-is


I do think it's worth considering to keep 5, but I can't see a good way to get rid of the ambiguity while keeping it. If we're trying to get rid of the ambiguity first, then I think a good solution could be to add an arrays-kwarg. The name "keys" is not an accurate description of what's happening anyway:

Docstring could look like:

def set_index(keys=None, arrays=None, drop=True, append=False, inplace=False,
              verify_integrity=False):
    """
    Set the index (row labels) using one or more given labels or arrays.

    keys : column label or list of column labels
    arrays : array-like or list of array-likes
    drop : ...
    [...]
    """
    if not (keys is None ^ arrays is None):
        raise ValueError('must pass exactly one of `keys`/`arrays`')
    # deprecate passing arrays as keys
jreback commented 5 years ago

When I evaluate an API change I try to achieve multiple objectives:

1) simple mental model 2) consistency

The most prevalent usecase DataFrame.set_index is one to set keys (from columns). This is a simple model and works well. Array values are barely documented and have a very limited set of test case, thus they are a rarely used case.

Adding Series.set_index would shift the primary use case to be setting with values. This IMHO violates consistency (2) because now if you are working with a DataFrame you have different primary use-cases. This violates the mental model (1) as we now have 2 methods that do exactly the same thing for Series, meaning .set_axis, it also violates (2) as you now still have to know whether you are working with a Series or a DataFrame, which I gather is the impetus for having Series.set_index in the first place.

I don't see a good way of reconciling this and thus don't see a good case of adding Series.set_index. (I had suggested a possible way forward of deprecating Series.set_axis to make room for Series.set_index, which would fix my objection, but still leave us with inconsistencies between DataFrame and Series for use case. Adding keyword args to .set_index is not a solution as this makes things more complicated.

As @toobaz notes above, if we were to start fresh we might do it differently, but back-compat here is paramount as its not very easy to change existing behavior.

Finally I would consider removing support for allowing arrays in DataFrame.set_index. For the small use cases where this is actually useful there are quite a lot of good solutions that did not exist when pandas first existed, e.g.

DataFrame.assign(new_col=.....).set_index(['A', 'new_col']) is quite idiomatic.

TomAugspurger commented 5 years ago

How do we know that setting arrays with DataFrame.set_index isn't common?

Would you say that, from a consistency point of view, accepting an array for Series.set_index is more consistent than not?

On Wed, Jan 9, 2019 at 6:01 AM Jeff Reback notifications@github.com wrote:

When I evaluate an API change I try to achieve multiple objectives:

  1. simple mental model
  2. consistency

The most prevalent usecase DataFrame.set_index is one to set keys (from columns). This is a simple model and works well. Array values are barely documented and have a very limited set of test case, thus they are a rarely used case.

Adding Series.set_index would shift the primary use case to be setting with values. This IMHO violates consistency (2) because now if you are working with a DataFrame you have different primary use-cases. This violates the mental model (1) as we now have 2 methods that do exactly the same thing for Series, meaning .set_axis, it also violates (2) as you now still have to know whether you are working with a Series or a DataFrame, which I gather is the impetus for having Series.set_index in the first place.

I don't see a good way of reconciling this and thus don't see a good case of adding Series.set_index. (I had suggested a possible way forward of deprecating Series.set_axis to make room for Series.set_index, which would fix my objection, but still leave us with inconsistencies between DataFrame and Series for use case. Adding keyword args to .set_index is not a solution as this makes things more complicated.

As @toobaz https://github.com/toobaz notes above, if we were to start fresh we might do it differently, but back-compat here is paramount as its not very easy to change existing behavior.

Finally I would consider removing support for allowing arrays in DataFrame.set_index. For the small use cases where this is actually useful there are quite a lot of good solutions that did not exist when pandas first existed, e.g.

DataFrame.assign(new_col=.....).set_index(['A', 'new_col']) is quite idiomatic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24046#issuecomment-452672672, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIig6JhF8-4WPdl1LhMvtXkDf924cks5vBdoOgaJpZM4Y9ZRa .

h-vetinari commented 5 years ago

@jreback Thanks for taking the time for a comprehensive answer.

The most prevalent usecase DataFrame.set_index is one to set keys (from columns). This is a simple model and works well. Array values are barely documented [...]

I agree with @TomAugspurger that we can't know the prevalence a-priori. df.set_index(df2.index) is simply too intuitive, and just works (assuming the same length), without having a look in the docs at all. Same for df.set_index([df2.index, df3.some_column]).

[...] and have a very limited set of test case, thus they are a rarely used case.

It is extensively tested, already before #22236 (but even more extensively since that PR) and #22486. In particular, all list-likes (except sets) are tested, and all combinations between list-likes, also crossed with column labels (i.e. case 5. of @toobaz' list above).

As @toobaz notes above, if we were to start fresh we might do it differently, but back-compat here is paramount as its not very easy to change existing behavior.

Of course I'm thinking of back-compat as well. How about the suggestion above to turn keys and arrays into kwargs? This would have to deprecate the mixed case, but would remove the ambiguity at least.

Finally I would consider removing support for allowing arrays in DataFrame.set_index. For the small use cases where this is actually useful there are quite a lot of good solutions that did not exist when pandas first existed, e.g. DataFrame.assign(new_col=.....).set_index(['A', 'new_col']) is quite idiomatic.

Won't that cause extra copying in assign? In any case, deprecating mixes of keys and arrays might be a good idea, but I strongly believe that arrays should not be deprecated (see above).

jreback commented 5 years ago

How do we know that setting arrays with DataFrame.set_index isn't common?

easy enough, its barely documents, and I have never seen a bug report about this.

Would you say that, from a consistency point of view, accepting an array for Series.set_index is more consistent than not?

no, this is very confusing because it is different than the primary usecase for Dataframe.set_index.

@h-vetinari

I agree with @TomAugspurger that we can't know the prevalence a-priori. df.set_index(df2.index) is simply too intuitive, and just works (assuming the same length), without having a look in the docs at all. Same for df.set_index([df2.index, df3.some_column]).

Show me where this is in the documentation

Of course I'm thinking of back-compat as well. How about the suggestion above to turn keys and arrays into kwargs? This would have to deprecate the mixed case, but would remove the ambiguity at least.

no this is stricty worse and more confusing.

Won't that cause extra copying in assign? In any case, deprecating mixes of keys and arrays might be a good idea, but I strongly believe that arrays should not be deprecated (see above).

I am not concerned about performance; being consistent is paramount. I am not strongly for deprecating, though I think it would be better, because as I say we already have a way to do this, .set_axis

h-vetinari commented 5 years ago

@jreback: Show me where this is in the documentation

0.12 - 0.20.3 (very likely even further back; 0.12 is the oldest one online):

>>> indexed_df = df.set_index(['A', 'B'])
>>> indexed_df2 = df.set_index(['A', [0, 1, 2, 0, 1, 2]])
>>> indexed_df3 = df.set_index([[0, 1, 2, 0, 1, 2]])

0.21.0 - master (last example):

Create a multi-index using a set of values and a column:

>> df.set_index([[1, 2, 3, 4], 'year'])
month  sale
year
1  2012  1      55
2  2014  4      40
3  2013  7      84
4  2014  10     31

This makes me think that even @toobaz' case 5 might be too entrenched to deprecate, but what could also be done is just deprecating list-likes that are not array-like (like for str.cat in #22264). I think this is my favourite solution so far, because it solves the ambiguity while maintaining essentially all functionality (i.e. passing keys / ndarray / Series / Index, or a list-like combination thereof).

Further options would be to just clearly document the current "list-gets-interpreted-as-keys", or add a groupby-like fall-back that a list without any keys will be used as an array. In all three cases, there is no conflict with Series.set_index supporting arrays.

@jreback: [...] I have never seen a bug report about this.

That could just as well be because it always either works or gives a clear error that the length does not match. (also see above: implemented since forever)

>>> pd.DataFrame(np.eye(2)).set_index(pd.Series([1, 2, 3]))
[...]
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
h-vetinari commented 5 years ago

Let me extract that last idea a bit, lest it get lost in the larger post above:

How about deprecating list-likes that are not array-like? This would:

@jreback @jorisvandenbossche @TomAugspurger @toobaz @WillAyd

h-vetinari commented 5 years ago

An implementation of that last suggestion is in #24697, to better be able to discuss this.

toobaz commented 5 years ago

How about deprecating list-likes that are not array-like? This would:

I already replied in #24697 ... but I'm afraid more in general your proposal doesn't solve the ambiguity nicely. If in "list-likes" you include lists, then

df.set_index([['a', 'b']])

now doesn't mean anything... which is (I guess) worse than it meaning "take these values". If in "list-likes" you exclude lists (which, if I understand correctly, is what you do in #24697), then the ambiguity is unaffected.

h-vetinari commented 5 years ago

@toobaz: If in "list-likes" you include lists, then df.set_index([['a', 'b']]) now doesn't mean anything

That is indeed the idea (and how it is implemented in #24697). I'd say it's unclear what df.set_index([['a', 'b']]) is even supposed to mean (for people who don't know the nested-list trick), and the double braces are easy to overlook.

Lists, like tuples, are used for too many things IMO (causing ambiguities, "magic" methods, and code complexity). I'd also postulate that outside of toy examples, they're hardly used as replacements for arrays either. In most real-world cases, people already have Series/Index/arrays that they're juggling.

We've had a similar deprecation in #22264 for str.cat, because it will make the code much easier to maintain, and does not sacrifice any essential functionality (it's very easy to wrap a correctly-sized list into an array/Series etc.).

h-vetinari commented 5 years ago

After not getting a response in #24697, I'd like to ask again how to proceed here. Deprecating lists within lists would have IMO been the most elegant solution, and was the only complaint against adding Series.set_index. Now that the decision has been taken to live with this ambiguity, can we please reopen #22225?

ghost commented 4 years ago

I waded into this before learning how contentious it is. I would like to propose the following compromise:

  1. Add Series.set_index with behavior matching current DataFrame.set_index with respect to arrays. List of lists not allowed.
  2. Immediately deprecate this case in both DataFrame/Series, and issue a deprecation warning when Arrays are used, telling users to use set_axis instead.
  3. keep set_index for column keys forever. keep set_axis for setting labels forever.
  4. after a deprecation cycle, remove the functionality from set_index

The benefits are:

  1. resolves the issue
  2. reduces inconsistency between Frame/Series and offers a clear transition path to separating the two cases, while guiding users toward the recommended alternative.
  3. after a deprecation cycle, leads to a state which satisfies everyone: no ambiguity, and both use case made easy for users.

if set_axis is frowned upon, the deprecation would users towards whatever method (relabel, relabel_axis, rename, whatever) should be the final resting place of this functionality. There's not much cost to making Series compatible with DataFrame's current behavior in this, and then taking steps to (finally) remove it for both.

It is vitally important to have a method-chaining friendly way to set a series index. I'd like to help make that happen. note that set_axis is not very well known currently, and appears in the documentation only in the API reference section. If acceptable, I can fix that as well.

h-vetinari commented 4 years ago

@pilkibun: I waded into this before learning how contentious it is.

Haha, me too. ;-)

Actually, a lot happened after #22225, namely the discussions in #24046 and #24697 are pertinent. The principal bone of contention (from my POV) was how to deal with the ambiguity of list_of_scalars being interpreted as an array-like vs. as a list_of_column_keys. Personally, I would have deprecated list_of_scalars_as_an_array, but that ultimately got rejected after non-public discussions of the core-devs.

In any case, the issue remains an sorely lacking feature IMO, as users are (and should be) discouraged to directly manipulate attributes (e.g. s.index = <index>), and Series.set_index is doubtlessly the most intuitive method for setting the index (and also missing as an analogue for DF.set_index). I do not think that set_axis is an adequate replacement.

So far, all mentioned proposals boil down to the following:

  1. do nothing
  2. add Series.set_index as in #22225 but live with the ambiguity of list_of_scalars (resp. resolve that ambiguity towards list_of_column_keys consistently)
  3. deprecate list_of_scalars_as_an_array as in #24697 and then add Series.set_index
  4. deprecate using set_index with arrays, and point to set_axis instead [if I understand you correctly]
  5. have different kwargs for keys and arrays in the signature ofset_index`

To me, only 2, 3 & 5 are reasonable solutions vis-à-vis user-friendliness.

EDIT: added a 5th option, which I had forgotten

toobaz commented 4 years ago

I'm still in favor of 4, but I admit I got totally lost in the discussion(s)

WillAyd commented 4 years ago

Yes I think option 4 as well

ghost commented 4 years ago

What I'm suggesting is, for now, to ensure as much parity between DataFrame and Series set_index. If a decision is later made to move some functionality elsewhere, it's no more difficult to deprecate both simultaneously. And If no agreement can be reached on that, I suggest it's acceptable to have the same functionality in both.

Note that set_index(array, append=true) and set_index([array, array]) both create MultiIndexes, which is useful and not currently possible with set_axis. We could add that later of course as prep for splitting it off. But there will still be a deprecation cycle, and in the meantime it's not much worse having arrays in DataFrame.set_index and in Series.set_index.

I have put together #27504 that closely follows DataFrame.set_index and I'd like your support for it. +1/-1 for this suggested plan of action, please?

toobaz commented 4 years ago

If a decision is later made to move some functionality elsewhere, it's no more difficult to deprecate both simultaneously

@pilkibun feel free to tell me if I misunderstood, but we certainly won't allow a feature which we plan to (probably) later deprecate

h-vetinari commented 4 years ago

@toobaz: I'm still in favor of 4, but I admit I got totally lost in the discussion(s)

@WillAyd: Yes I think option 4 as well

I think whichever future pandas version were to remove arrays for set_index would receive massive pushback - it is one of the most intuitive parts of the API, and just works.

Also, index is an axis, so not least from the point of consistency, I'd say that set_axis should actually have the same capabilities as set_index (with the only difference being an axis-switch). It would be confusing (i.e. opposite of intuitive) to remember which of those extremely similar methods does what, if option 4 were pursued.

toobaz commented 4 years ago

I think whichever future pandas version were to remove arrays for set_index would receive massive pushback - it is one of the most intuitive parts of the API, and just works.

It's trivial to replace with set_axis, which also just works. The difference (again, if I remember correctly) is just that you cannot mix labels and arrays, but I really don't expect this to be a frequent use case.

In any case, I've never seen set_axis used with arrays in real code. I won't say it never happens, but I definitely think it is a marginal use case.

Also, index is an axis, so not least from the point of consistency, I'd say that set_axis should actually have the same capabilities as set_index (with the only difference being an axis-switch).

"consistency" is not to have duplicated methods with different names. The day that set_axis was really just a subset of set_index, we would probably just deprecate it entirely ;-)

What is definitely not consistent is that we have set_index but not set_columns. Not a huge problem, but another reason why we maybe don't want to sponsor set_index over set_axis.

h-vetinari commented 4 years ago

@toobaz, you're still mixing up a few things (or maybe I need to write more clearly...? or you mistyped...?)

In any case, I've never seen set_axis used with arrays in real code. I won't say it never happens, but I definitely think it is a marginal use case.

The point is using arrays with set_INDEX, and keeping that use, which is certainly not marginal. Not least because it is, by a country mile, the first thing people turn to to set the index (which is very reasonable with an array of the same lengths as the Series/DF).

It's trivial to replace with set_axis, which also just works. The difference (again, if I remember correctly) is just that you cannot mix labels and arrays, but I really don't expect this to be a frequent use case.

Trivial, but harmful, because it would be removing the most obvious method for the job (sacrificing this usability should require something that's demonstrably better- and I'd challenge that option 4 satisfies that).

Mixing column labels and arrays might actually be a marginal use case, but it's been documented since before v0.12, so I'll reserve judgement. It would be much less impact to deprecate this aspect only, but that does not solve the ambiguity of list_of_scalar.

"consistency" is not to have duplicated methods with different names. The day that set_axis was really just a subset of set_index, we would probably just deprecate it entirely ;-)

It would be a superset (as in, encompassing both set_index and a putative set_columns, and could just be a switch between those). Although set_axis would be the first of those three that I'd get rid of, if necessary, I think that setting an index/column/axis are all eminently obvious and useful operations, that deserve their own method. And since they perform the exact same task, they should also have the same interface (as far as reasonably possible; i.e. no column keys for Series).

What is definitely not consistent is that we have set_index but not set_columns.

I'd argue (both above and in the OP), that this method should be added.


TLDR: An argument for option 4 would have to demonstrate a much greater user benefit for removing such fundamental functionality as using arrays in set_index. So far, the only reason given was the (IMO much weaker argument) that list_of_scalars is ambiguous, which could be solved with much less impact by deprecating list_of_scalars_as_array (i.e. option 3).