Open fansichao opened 4 years ago
You're using label-based indexing using .loc. Pandas therefore does not infer that you want to replace column b with column d and column c with column e-- that would be positional logic.
Either you can use .iloc for this use case, or else rename the columns before setting:
In [6]: df.loc[df['a']=='a2', ['b','c']] = df[['d','e']].rename(columns={'d':'b', 'e':'c'})
In [7]: df
Out[7]:
a b c d e
0 a1 b1 c1 d1 e1
1 a2 d2 e2 d2 e2
Alternatively you can also use the setting using .to_numpy, see the warning box at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics. In this case you need to do the alignment on the right hand side yourself:
In [9]: df.loc[df['a']=='a2', ['b','c']] = df.loc[df['a'] == 'a2', ['d','e']].to_numpy()
In [10]: df
Out[10]:
a b c d e
0 a1 b1 c1 d1 e1
1 a2 d2 e2 d2 e2
I agree with @Liam3851 the weird thing is that df.loc[df['a']=='a2', ['c']] = df['e']
works. This should also set NaN. But probably will be fixed in the future with the split_path fix @jbrockmendel ? This currently runs through _setitem_single_block
This should also set NaN.
@phofl can you elaborate on what behavior you expect?
I think id expect df.loc[df['a']=='a2', ['c']] = df['e']
to raise ValueError because of a shape mismatch.
I would have expected that _align_series
sets the value to NaN, similarly to
indexer = df['a'] == 'a2'
rhs = df[['e']]
df.loc[indexer, ['c']] = rhs
Why would you expect shape missmatch?
Why would you expect shape missmatch?
well df.loc[indexer, ["c"]]
has shape (1, 1) and rhs
has shape (2, 1) (just (2,) if we use df["e"]
instead of df[["e"]]
). To not get a shape mismatch, I'd expect to see df.loc[indexer, ["c"]] = rhs.loc[indexer]
Ah I see. My understanding of __setitem__
for loc was, that it would take care of the filtered rows (e.g. the _align_series
and _align_dataframe
are doing this?), so that the rhs is filtered for the same rows as the lhs. Or is this an accident and should change in the future?
Its entirely plausible that im wrong on this.
If instead of df.loc[indexer, ["c"]] = df["e"]
we did df.loc[indexer, ["c"]] = df["e"]._values
that does raise like I would expect. Would you expect the filtering to behave differently there?
In case of _values
None of the align functions is called, because we got a Numpy array.
I quite like the feature, that you do not have to specify the same loc condition on the rhs but it gets filtered nevertheless. You could probably construct examples where this is a disadvantage instead of an advantage. But I think we should not break this code without warning, if we decide to do this. Don't know if this was intended with the implementation or only a side effect.
@jbrockmendel Isn't the difference that in df.loc[indexer, ["c"]] = df["e"]
the rhs has an index that is compatible with the lhs (and so an align operation works to make the rhs compatible with the lhs)? In the case of df.loc[indexer, ["c"]] = df["e"]._values
you have stripped the values out, losing the index, and thus making alignment impossible.
@phofl I think we need to differentiate:
df.loc[indexer, ["c"]] = df["e"]
from
df.loc[indexer, ["c"]] = df[["e"]]
The first one, you are assigning a Series to a DataFrame. This realigns to the df's index and then broadcasts across the DataFrame with all columns selected.
For example this is totally fine and assigns columns a and b to the values of column e:
df.loc[indexer, ["a", "b"]] = df["e"]
In the second, you are assigning a DataFrame to a DataFrame. Because the columns are not compatible they get assigned as NaN.
Yep you are right. I would expect
df.loc[indexer, "c"] = df["e"]
to work.
df.loc[indexer, ["c"]] = df["e"]
seems weird.
I agree
df.loc[df['a']=='a2', ['c']] = df['e']
looks a bit weird (you'd probably spell it the first way you gave above), but do you agree that
df.loc[df['a'] == 'a2', ['a', 'b']] = df['e']
makes sense as a broadcast (broadcasting the Series across the DataFrame)? If so then
df.loc[df['a']=='a2', ['c']] = df['e']
more or less has to be supported (or at least it would be weird not to support it).
I still find it counter-intuitive that both df.loc[indexer, ["c"]] = rhs
and df.loc[indexer, ["c"]] = rhs.loc[indexer]
would behave identically. We should figure out whether this is intentional or not, and @phofl is right if it isn't intentional, it needs to be deprecated instead of removed immediately.
Yep I would agree with counter-intuitive. Just thought this is a nice feature, if it is not a bug:)
@jorisvandenbossche can you weigh on in what the intended behavior is here (no hurry)
related to #10440
Python 3.6.8 pandas==0.25.3