jekwatt / idiomatic_pandas

Tips and tricks for the most common data handling task with pandas.
0 stars 0 forks source link

Dealing with SettingWithCopyWarning #15

Open jekwatt opened 3 years ago

jekwatt commented 3 years ago
  1. Try using .loc[row_indexer,col_indexer] = value instead.
    
    The proper way to modify df is to apply one of the accessors
    .loc[], 
    .iloc[], 
    .at[], 
    or .iat[]

mask = df["A"] > 5 df.loc[mask, 'B'] = 4



2. Make a deepcopy:
`df2 = df[["A"]].copy(deep=True)`

3. Change pd.options.mode.chained_assignment:
`pd.set_option("mode.chained_assignment", None)`
jekwatt commented 3 years ago

Example:

dfc = df.copy()

# setting multiple items using a mask
mask = dfc["a"].str.startswith('o')
mask = (dfc["a"] == "other")

# conditional change of column values
dfc.loc[mask, "a"] = 42

# change a cell value
dfc.iloc[0, 1] = "fake"

# same as above, but using loc
dfc.loc[0, "a"] = "fake"
jekwatt commented 3 years ago

Change the Default SettingWithCopyWarning Behavior The SettingWithCopyWarning is a warning, not an error. Your code will still execute when it’s issued, even though it may not work as intended.

To change this behavior, you can modify the Pandas mode.chained_assignment option with pandas.set_option(). You can use the following settings:

pd.set_option("mode.chained_assignment", "raise") raises a SettingWithCopyException.
pd.set_option("mode.chained_assignment", "warn") issues a SettingWithCopyWarning. This is the default behavior.
pd.set_option("mode.chained_assignment", None) suppresses both the warning and the error.
jekwatt commented 3 years ago

.at[] is for changing or looking up single value (performance reason). df.at[2, 'B'] = 4

jekwatt commented 2 years ago

Answers from James Powell: There are a couple of strategies. One strategy is to never mutate a Series or DataFrame—always make copies or always create new columns. (This is a common guidance we've seen our consulting clients give their data scientists and modelers.)

Treating these pandas structures as though they were immutable (and as though you don't have to worry about running out of memory) is generally how I approach things, but there are still cases where you want to mutate something.

In the cases where you have to mutate something, you want to make sure that you are performing the mutation on the original object. This means ensuring that your mutating operation is a consequence of syntax which ties directly back to the original object. In other words, the syntax is no more indirected (has no more steps than) df.loc[…] = … or df.iloc[…]

As Cameron showed, if you need to dig into the object to find the items you want to edit, then you do that as a separate step (in his example, by setting up a mask beforehand) but you keep the setting part of the operation to just df.loc[ … ] = … or df.iloc[ … ] = … with no additional dotted/getattr accesses or square-bracked/getitem accesses.

By the way, this is the recent blog post from Joris Van den Bossche that Cameron mentioned about changing the default behaviour in pandas: https://jorisvandenbossche.github.io/blog/2022/04/07/pandas-copy-views/