Closed jbrockmendel closed 2 years ago
cc @jreback @jorisvandenbossche @TomAugspurger discussed briefly on today's call could use your inputs
gentle ping @jreback a lot of indexing work hinges on whether we want to change this
API: df['col'] = value and df.loc[:,'col'] = value are now completely equivalent
i think these should be completely equivalent always i always thought we did this
i always thought we did this
Two problems here:
1) We don't have a way for users to do this setting that is inplace.
2) it isn't actually completely equivalent. If value
has the same dtype as the existing column, it'll be inplace.
xref #15631
Marking as blocker for 1.4rc; we should make a decision about deprecation before then.
Personally I think I would like the consistency of .iloc[row_indexer, column_indexer]
always being inplace. But if we do that, we should provide an alternative to achieve the same: setting a column using a positional index (with new values, not in place, like df[label] = ..
but using an integer index instead of a label).
Without the current iloc
behaviour, this is not (easily) possible to do I think with our current APIs?
Also, if we change this, I think it should go with some deprecation cycle, since this was explicitly designed this way before (it's not that we are fixing a bug).
Also, if we change this, I think it should go with some deprecation cycle, since this was explicitly designed this way before (it's not that we are fixing a bug).
Yah, that's why i'm pushing to get this done for 1.4 so we can enforce in 2.0.
should provide an alternative to achieve the same: setting a column using a positional index (with new values, not in place, like df[label] = .. but using an integer index instead of a label).
Agreed. Do you have any preferences on what that API would look like? I think something like df.iloc(axis=1)[foo] = bar
might work in theory (though i doubt we have good test coverage for that)
Turns out the df.iloc(axis=1)[foo]=bar
doesn't work ATM bc of #45032
Personally I think I would like the consistency of
.iloc[row_indexer, column_indexer]
always being inplace. But if we do that, we should provide an alternative to achieve the same: setting a column using a positional index (with new values, not in place, likedf[label] = ..
but using an integer index instead of a label). Without the currentiloc
behaviour, this is not (easily) possible to do I think with our current APIs?Also, if we change this, I think it should go with some deprecation cycle, since this was explicitly designed this way before (it's not that we are fixing a bug).
i don't think this is particular useful for integer indices, so -1 on adding yet another method to do the same thing.
i don't think this is particular useful for integer indices, so -1 on adding yet another method to do the same thing.
The trouble is that if you have non-unique columns then there is no public way of doing this
i don't think this is particular useful for integer indices, so -1 on adding yet another method to do the same thing.
The trouble is that if you have non-unique columns then there is no public way of doing this
i get it.
I mean another option here is to go back to doing real masking instead of setitem with .loc[:, columns]
(and .iloc
)
but i think that creates more issues
I mean another option here is to go back to doing real masking instead of setitem with .loc[:, columns] (and .iloc)
I'm not clear on what you're suggesting. In the status quo df.loc[:, foo] = bar
never tries to operate in-place, is identical to df[foo] = bar
. The proposal is that we should try to operate in-place, just like we would for df.loc[:-1, foo] = bar[:-1]
.
Under the status quo, there is no way to do a try-to-operate-in-place setting of an entire column. Under the proposed change, there is no way to do a dont-try-to-operate-in-place positional setting of an entire column.
I'm not clear on what you're suggesting. In the status quo df.loc[:, foo] = bar never tries to operate in-place, is identical to df[foo] = bar. The proposal is that we should try to operate in-place, just like we would for df.loc[:-1, foo] = bar[:-1].
we used to do exactly this and i changed it a long time ago because it was very confusing
we used to do exactly this and i changed it a long time ago because it was very confusing
how is ".iloc[foo, bar] = baz always tries inplace" more confusing than "that, except for .iloc[:, bar] = baz which is almost never inplace, the exception being when baz has the same dtype as the existing column (but uh, not for ArrayManager or (some) ExtensionDtypes!)"
In 1.3.0 we made it so that
df["A"] = foo
never operates in-place (xref whatsnew, #33457) and mostly made it so thatdf.loc[foo, bar] = baz
tries to operate in-place before falling back to casting (xref whatsnew, https://github.com/pandas-dev/pandas/issues/33457#issuecomment-660126628)There are some remaining cases where .iloc and .loc do not try to set inplace. This issue is to make sure we have consensus about changing/deprecating that behavior. Most of the impacted test cases involve doing an astype, e.g.
ATM this changes df[0] to float64, i.e. is not inplace.
In this case, the user could/should do
df[0] = df[0].astype(np.float64)
. But that approach runs into problems with either non-unique columns or setting a slice of columnsdf.iloc[:, :2] = foo
My thoughts are that 1) we should be consistent about inplace-ness, not special-case the API and 2) the use case here is convenient.
Thoughts?
update Cataloguing issues/PRs where this behavior we implemented apparently-intentionally
29393 (#25495)
6149
6159 BUG/API: df['col'] = value and df.loc[:,'col'] = value are now completely equivalent
update 2 Cataloguing issues that appear to be caused by/related to this inconsistency
20635 BUG: indexing with loc and iloc with list-likes and new dtypes do not change from object dtype
24269 DataFrame.loc/iloc fails to update object with new dtype