DataFrame.loc multiple columns replace

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.26k stars 17.8k forks source link

DataFrame.loc multiple columns replace #30439

Open fansichao opened 4 years ago

fansichao commented 4 years ago

Python 3.6.8 pandas==0.25.3

#! -*- coding:utf-8 -*-
import pandas as pd
df = pd.DataFrame([
    {'a':'a1','b':'b1','c':'c1','d':'d1','e':'e1'},
    {'a':'a2','b':'b2','c':'c2','d':'d2','e':'e2'}
    ])
import copy
df2 = copy.deepcopy(df)

print(df)
#     a   b   c   d   e
# 0  a1  b1  c1  d1  e1
# 1  a2  b2  c2  d2  e2

df.loc[df['a']=='a2', ['c']] = df['e']
print(df)
#     a   b   c   d   e
# 0  a1  b1  c1  d1  e1
# 1  a2  b2  e2  d2  e2

# loc muti columns replace has some problem
df.loc[df['a']=='a2', ['b','c']] = df[['d','e']]
print(df)
#     a    b    c   d   e
# 0  a1   b1   c1  d1  e1
# 1  a2  NaN  NaN  d2  e2

Liam3851 commented 4 years ago

You're using label-based indexing using .loc. Pandas therefore does not infer that you want to replace column b with column d and column c with column e-- that would be positional logic.

Either you can use .iloc for this use case, or else rename the columns before setting:

In [6]: df.loc[df['a']=='a2', ['b','c']] = df[['d','e']].rename(columns={'d':'b', 'e':'c'})

In [7]: df
Out[7]:
    a   b   c   d   e
0  a1  b1  c1  d1  e1
1  a2  d2  e2  d2  e2

Alternatively you can also use the setting using .to_numpy, see the warning box at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics. In this case you need to do the alignment on the right hand side yourself:


In [9]: df.loc[df['a']=='a2', ['b','c']] = df.loc[df['a'] == 'a2', ['d','e']].to_numpy()

In [10]: df
Out[10]:
    a   b   c   d   e
0  a1  b1  c1  d1  e1
1  a2  d2  e2  d2  e2

phofl commented 3 years ago

I agree with @Liam3851 the weird thing is that df.loc[df['a']=='a2', ['c']] = df['e'] works. This should also set NaN. But probably will be fixed in the future with the split_path fix @jbrockmendel ? This currently runs through _setitem_single_block

jbrockmendel commented 3 years ago

This should also set NaN.

@phofl can you elaborate on what behavior you expect?

I think id expect df.loc[df['a']=='a2', ['c']] = df['e'] to raise ValueError because of a shape mismatch.

phofl commented 3 years ago

I would have expected that _align_series sets the value to NaN, similarly to

indexer = df['a'] == 'a2'
rhs = df[['e']]
df.loc[indexer, ['c']] = rhs

Why would you expect shape missmatch?

jbrockmendel commented 3 years ago

Why would you expect shape missmatch?

well df.loc[indexer, ["c"]] has shape (1, 1) and rhs has shape (2, 1) (just (2,) if we use df["e"] instead of df[["e"]]). To not get a shape mismatch, I'd expect to see df.loc[indexer, ["c"]] = rhs.loc[indexer]

phofl commented 3 years ago

Ah I see. My understanding of __setitem__ for loc was, that it would take care of the filtered rows (e.g. the _align_series and _align_dataframe are doing this?), so that the rhs is filtered for the same rows as the lhs. Or is this an accident and should change in the future?

jbrockmendel commented 3 years ago

Its entirely plausible that im wrong on this.

If instead of df.loc[indexer, ["c"]] = df["e"] we did df.loc[indexer, ["c"]] = df["e"]._values that does raise like I would expect. Would you expect the filtering to behave differently there?

phofl commented 3 years ago

In case of _values None of the align functions is called, because we got a Numpy array.

I quite like the feature, that you do not have to specify the same loc condition on the rhs but it gets filtered nevertheless. You could probably construct examples where this is a disadvantage instead of an advantage. But I think we should not break this code without warning, if we decide to do this. Don't know if this was intended with the implementation or only a side effect.

Liam3851 commented 3 years ago

@jbrockmendel Isn't the difference that in df.loc[indexer, ["c"]] = df["e"] the rhs has an index that is compatible with the lhs (and so an align operation works to make the rhs compatible with the lhs)? In the case of df.loc[indexer, ["c"]] = df["e"]._values you have stripped the values out, losing the index, and thus making alignment impossible.

@phofl I think we need to differentiate: df.loc[indexer, ["c"]] = df["e"]

from df.loc[indexer, ["c"]] = df[["e"]]

The first one, you are assigning a Series to a DataFrame. This realigns to the df's index and then broadcasts across the DataFrame with all columns selected.

For example this is totally fine and assigns columns a and b to the values of column e: df.loc[indexer, ["a", "b"]] = df["e"]

In the second, you are assigning a DataFrame to a DataFrame. Because the columns are not compatible they get assigned as NaN.

phofl commented 3 years ago

Yep you are right. I would expect

df.loc[indexer, "c"] = df["e"]

to work.

df.loc[indexer, ["c"]] = df["e"]

seems weird.

Liam3851 commented 3 years ago

I agree

df.loc[df['a']=='a2', ['c']] = df['e']

looks a bit weird (you'd probably spell it the first way you gave above), but do you agree that

df.loc[df['a'] == 'a2', ['a', 'b']] = df['e']

makes sense as a broadcast (broadcasting the Series across the DataFrame)? If so then

df.loc[df['a']=='a2', ['c']] = df['e']

more or less has to be supported (or at least it would be weird not to support it).

jbrockmendel commented 3 years ago

I still find it counter-intuitive that both df.loc[indexer, ["c"]] = rhs and df.loc[indexer, ["c"]] = rhs.loc[indexer] would behave identically. We should figure out whether this is intentional or not, and @phofl is right if it isn't intentional, it needs to be deprecated instead of removed immediately.

phofl commented 3 years ago

Yep I would agree with counter-intuitive. Just thought this is a nice feature, if it is not a bug:)

jbrockmendel commented 3 years ago

@jorisvandenbossche can you weigh on in what the intended behavior is here (no hurry)

phofl commented 3 years ago

related to #10440