Open jorisvandenbossche opened 9 years ago
@jorisvandenbossche this is a really nice summary.
I think in general we can move []/.ix
closer (maybe can get identical), so as not to have any confusion. (of course we may have to eliminate fallback which is not a bad thing anyhow).
I suppose we should prepare any changes for 0.17.0 as these will technically be API changes.
xref #7501 , #8976, #7187
xref #9213, CC @hugadams @dandavison
@jorisvandenbossche Indeed, this is a nice summary of current behavior. Thanks!
I think we should consider radical API changes for __getitem__
if we want pandas to have a lasting influence.
My two cents on indexing is that "fallback indexing" is a really bad idea. It starts with the best of intentions, but leads to things like special cases like distinctions between integer and float indexes (e.g., see #9213). In the face of ambiguity, refuse the temptation to guess.
So if I were reinventing indexing rules from scratch, I would consider something like this (for DataFrame
):
That's it. Two simple rules that probably cover 90% of existing uses of __getitem__
, at least the only ones that I could ever keep straight (string column labels and boolean arrays). Importantly, indexing would never depend on the type of the index and there would be no reindexing/NaN-filling behavior. We could also eliminate the need for .iloc
as a separate indexer entirely.
This sort of change would require a serious deprecation cycle or perhaps need to wait until pandas 1.0 (likely both), but something needs to change. The fact that even pandas developers need to run extensive experiments to figure out how __getitem__
works indicates just how wrong things are. Indexing should be simple enough that its behavior can be relied on in production code. The current state of indexing is, frankly, embarrassing.
@jorisvandenbossche Did you ever figure out how __setitem__
works? :)
@shoyer nope :-) I would suspect it is largely the same, but you never know ... Will try to look at it next week
I wanted to add this here since it is somewhat related to "String parsing for a datetime index does not seem to work" mentioned above and I have not seen it come up anywhere else. For a MultiIndex, string parsing for a datetime index with a scalar does not result in dropping the MultiIndex level.
In [2]: dfm = pd.DataFrame([1, 2, 3], index=pd.MultiIndex.from_arrays([pd.date_range("2015-01-01", "2015-01-03"), ['A', 'A', 'B']]))
In [3]: dfm.loc["2015-01-01"]
Out[3]:
0
2015-01-01 A 1
In [4]: dfm.loc[pd.Timestamp("2015-01-01")]
Out[4]:
0
A 1
this seems like somewhat unintuitive behaviour (to me at least)
@matthewgilbert this is just how partial string indexing works, see the docs here. The first is treated as a slice, while the second is an exact match.
I came around this and this seems related but could also be a bug in the above interacting with the CategoricalIndex. Using the same example as #15470:
pandas 0.20.3
s = pd.Series([2, 1, 0], index=pd.CategoricalIndex([2, 1, 0]))
s[2] # works (interpreting as label)
s.loc[2] # fails with TypeError: cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [2] of <class 'int'>
# of course the below works!
s = pd.Series([2, 1, 0], index=[2, 1, 0])
s[2] # works (interpreting as label)
s.loc[2] # works (interpreting as label)
@aavanian that looks like a bug. Could you open a separate issue for it?
Sure, done in #17569
If I were to rebuild pandas, I would make indexing as simple as possible and only use .loc
and .iloc
. I would not implement __getitem__
. There would be no ambiguity. I also wouldn't allow attribute access to columns. It would be a pain to select a single column df.loc[:, 'col']
but pandas really needs to focus on being explicit.
I just came here just for @jorisvandenbossche:
Summary for DataFrames
* It uses the 'information' axis (axis 1) for: * single labels * list of labels * It uses the rows (axis 0) for: * slicing * boolean indexing
Thanks for the rest of the analysis! Agree it's a mess. @shoyer:
* All other indexing is position based, NumPy style. (This includes indexing with a boolean array.)
I think I disagree:
In [1]: df = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('xyz'), index=list('xyz'))
In [2]: df
Out[2]:
x y z
x 0 1 2
y 3 4 5
z 6 7 8
In [3]: df['x'] # By columns
Out[3]:
x 0
y 3
z 6
Name: x, dtype: int64
In [4]: df[['x', 'y']] # By columns
Out[4]:
x y
x 0 1
y 3 4
z 6 7
In [5]: df['x':'y'] # By rows now!?
Out[5]:
x y z
x 0 1 2
y 3 4 5
Not intuitive, and is even more confusing to the beginning when you cross reference this against the behavior of df.loc[:,<X>]
which works the same as df[<X>]
for the first two cases but not the third. IMO df[<X>]
should be identical or close as possible to df.loc[:,<X>]
.
In general a "[]
is for cols, .loc[]
is for rows" convention would be most intuitive, if []
is not dropped completely.
@sam-at-github in my suggested model, indexing like df['x':'y']
would actually trigger an exception (because strings are not valid positional indexes). You'd have to use .loc
if you wanted that sort of indexing.
Oh OK, wasn't sure what you meant. I still don't think I like that much. For the second point I would prefer "every thing else fails" over switching the behavior of []
from selection on col labels to index only based selection on rows (I'm presuming you mean on rows). In my mind that doesn't address the main inconsistency: switching from a col primary op to a row primary op depending on the operand, especially in the context of the existence of .loc[]
which is already for row primary stuff. Prefer anything consistent with "[]
for cols, loc[]
for rows".
Update: Aside, to only allow positional slicing and not "label" based is probably even more confusing since your labels can be numerical anyway:
In [8]: df = pd.DataFrame(np.arange(9).reshape((3,3)))
In [9]: df[0] # By columns
Out[9]:
0 0
1 3
2 6
Name: 0, dtype: int64
In [10]: df[[0,1]] # By columns
Out[10]:
0 1
0 0 1
1 3 4
2 6 7
In [11]: df[0:1] # By rows now?!
Out[11]:
0 1 2
0 0 1 2
Are there things we can change? (that would not be too disruptive .. maybe not?) And want change?
I'd also like to know the answer to this question.
The behavior that surprised me today was the few cases where DataFrame.__getitem__[key]
does a row-based lookup rather than a column-based lookup. If deprecating any of this behavior is an option, I advocate starting with making DataFrame.__getitem__
always column-based.
Are there things we can change? (that would not be too disruptive .. maybe not?) And want change?
I'd also like to know the answer to this question.
The behavior that surprised me today was the few cases where
DataFrame.__getitem__[key]
does a row-based lookup rather than a column-based lookup. If deprecating any of this behavior is an option, I advocate starting with makingDataFrame.__getitem__
always column-based.
i believe we have an issue for this; would be +1 in depreciation
@jbrockmendel can you first open (or search) an issue for this to have a discussion about it?
@jorisvandenbossche im putting together an overview of the state of the indexing code. Is the description of the API here still accurate/complete?
Hey, I'm working on a join-like API ClosestItem(left: Series, right: Series, max_distance: float)
which returns [index_of_right(v) for v in left]
now. I would love to return Series
, please give me some suggestions on:
left
and right
, if they are available. max_distance
, what should I return? Currently just -1. Should I use None
or anything else? Thanks
@FluorineDog that's doesn't really seem related to this issue. Can you please open a new issue about it?
some examples (on Series only) in #12890
I started making an overview of the indexing semantics with http://nbviewer.ipython.org/gist/jorisvandenbossche/7889b389a21b41bc1063 (only for series/frame, not for panel)
Conclusion: it is mess :-)
Summary for slicing
So, you can say that the behaviour is equivalent to
.ix
, except that the behaviour for integer labels is different for integer indexers (swapped). (For.ix
, when having an integer axis, it is always label based and no fallback to integer location based).Summary for single label
Summary for indexing with list of labels
This mainly follows
ix
, apart from points 2 and 3Summary for boolean indexing
Summary for DataFrames
This is as documented (only the boolean case is not explicitely documented I think).
For the rest (on the choses axis), it follows the same semantics as
[]
on a series, but:Questions are here: