pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.45k stars 17.86k forks source link

BUG/API: DataFrame.iloc[:, foo] = bar inplaceness? #44353

Closed jbrockmendel closed 2 years ago

jbrockmendel commented 2 years ago

In 1.3.0 we made it so that df["A"] = foo never operates in-place (xref whatsnew, #33457) and mostly made it so that df.loc[foo, bar] = baz tries to operate in-place before falling back to casting (xref whatsnew, https://github.com/pandas-dev/pandas/issues/33457#issuecomment-660126628)

There are some remaining cases where .iloc and .loc do not try to set inplace. This issue is to make sure we have consensus about changing/deprecating that behavior. Most of the impacted test cases involve doing an astype, e.g.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(4).reshape(2, 2), dtype=object)
df.iloc[:, 0] = df.iloc[:, 0].astype(np.float64)

ATM this changes df[0] to float64, i.e. is not inplace.

In this case, the user could/should do df[0] = df[0].astype(np.float64). But that approach runs into problems with either non-unique columns or setting a slice of columns df.iloc[:, :2] = foo

My thoughts are that 1) we should be consistent about inplace-ness, not special-case the API and 2) the use case here is convenient.

Thoughts?

update Cataloguing issues/PRs where this behavior we implemented apparently-intentionally

29393 (#25495)

6149

6159 BUG/API: df['col'] = value and df.loc[:,'col'] = value are now completely equivalent

update 2 Cataloguing issues that appear to be caused by/related to this inconsistency

20635 BUG: indexing with loc and iloc with list-likes and new dtypes do not change from object dtype

24269 DataFrame.loc/iloc fails to update object with new dtype

jbrockmendel commented 2 years ago

cc @jreback @jorisvandenbossche @TomAugspurger discussed briefly on today's call could use your inputs

jbrockmendel commented 2 years ago

gentle ping @jreback a lot of indexing work hinges on whether we want to change this

jreback commented 2 years ago

API: df['col'] = value and df.loc[:,'col'] = value are now completely equivalent

i think these should be completely equivalent always i always thought we did this

jbrockmendel commented 2 years ago

i always thought we did this

Two problems here:

1) We don't have a way for users to do this setting that is inplace. 2) it isn't actually completely equivalent. If value has the same dtype as the existing column, it'll be inplace.

jbrockmendel commented 2 years ago

xref #15631

jbrockmendel commented 2 years ago

Marking as blocker for 1.4rc; we should make a decision about deprecation before then.

jorisvandenbossche commented 2 years ago

Personally I think I would like the consistency of .iloc[row_indexer, column_indexer] always being inplace. But if we do that, we should provide an alternative to achieve the same: setting a column using a positional index (with new values, not in place, like df[label] = .. but using an integer index instead of a label).
Without the current iloc behaviour, this is not (easily) possible to do I think with our current APIs?

Also, if we change this, I think it should go with some deprecation cycle, since this was explicitly designed this way before (it's not that we are fixing a bug).

jbrockmendel commented 2 years ago

Also, if we change this, I think it should go with some deprecation cycle, since this was explicitly designed this way before (it's not that we are fixing a bug).

Yah, that's why i'm pushing to get this done for 1.4 so we can enforce in 2.0.

should provide an alternative to achieve the same: setting a column using a positional index (with new values, not in place, like df[label] = .. but using an integer index instead of a label).

Agreed. Do you have any preferences on what that API would look like? I think something like df.iloc(axis=1)[foo] = bar might work in theory (though i doubt we have good test coverage for that)

jbrockmendel commented 2 years ago

Turns out the df.iloc(axis=1)[foo]=bar doesn't work ATM bc of #45032

jreback commented 2 years ago

Personally I think I would like the consistency of .iloc[row_indexer, column_indexer] always being inplace. But if we do that, we should provide an alternative to achieve the same: setting a column using a positional index (with new values, not in place, like df[label] = .. but using an integer index instead of a label). Without the current iloc behaviour, this is not (easily) possible to do I think with our current APIs?

Also, if we change this, I think it should go with some deprecation cycle, since this was explicitly designed this way before (it's not that we are fixing a bug).

i don't think this is particular useful for integer indices, so -1 on adding yet another method to do the same thing.

jbrockmendel commented 2 years ago

i don't think this is particular useful for integer indices, so -1 on adding yet another method to do the same thing.

The trouble is that if you have non-unique columns then there is no public way of doing this

jreback commented 2 years ago

i don't think this is particular useful for integer indices, so -1 on adding yet another method to do the same thing.

The trouble is that if you have non-unique columns then there is no public way of doing this

i get it.

jreback commented 2 years ago

I mean another option here is to go back to doing real masking instead of setitem with .loc[:, columns] (and .iloc)

but i think that creates more issues

jbrockmendel commented 2 years ago

I mean another option here is to go back to doing real masking instead of setitem with .loc[:, columns] (and .iloc)

I'm not clear on what you're suggesting. In the status quo df.loc[:, foo] = bar never tries to operate in-place, is identical to df[foo] = bar. The proposal is that we should try to operate in-place, just like we would for df.loc[:-1, foo] = bar[:-1].

Under the status quo, there is no way to do a try-to-operate-in-place setting of an entire column. Under the proposed change, there is no way to do a dont-try-to-operate-in-place positional setting of an entire column.

jreback commented 2 years ago

I'm not clear on what you're suggesting. In the status quo df.loc[:, foo] = bar never tries to operate in-place, is identical to df[foo] = bar. The proposal is that we should try to operate in-place, just like we would for df.loc[:-1, foo] = bar[:-1].

we used to do exactly this and i changed it a long time ago because it was very confusing

jbrockmendel commented 2 years ago

we used to do exactly this and i changed it a long time ago because it was very confusing

how is ".iloc[foo, bar] = baz always tries inplace" more confusing than "that, except for .iloc[:, bar] = baz which is almost never inplace, the exception being when baz has the same dtype as the existing column (but uh, not for ArrayManager or (some) ExtensionDtypes!)"