pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.29k stars 17.8k forks source link

pandas' recommendation on inplace deprecation and categorical column #57104

Open adrinjalali opened 7 months ago

adrinjalali commented 7 months ago

Working on making scikit-learn's code pandas=2.2.0 compatible, here's a minimal reproducer for where I started:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"].replace(to_replace="a", value="b", inplace=True)

which results in:

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"].replace(to_replace="a", value="b", inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 7963, in replace
    warnings.warn(
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

The first pattern doesn't apply here, so from this message, I understand I should do:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].replace(to_replace="a", value="b")

But this also fails with:

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].replace(to_replace="a", value="b")
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 8135, in replace
    new_data = self._mgr.replace(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/base.py", line 249, in replace
    return self.apply_with_block(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 364, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 854, in replace
    values._replace(to_replace=to_replace, value=value, inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2665, in _replace
    warnings.warn(
FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.

With a bit of reading docs, it seems I need to do:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].cat.rename_categories({"a": "b"})

which fails with

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].cat.rename_categories({"a": "b"})
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/accessor.py", line 112, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2939, in _delegate_method
    res = method(*args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 1205, in rename_categories
    cat._set_categories(new_categories)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 924, in _set_categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 221, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 378, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 579, in validate_categories
    raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique

So rename_categories is not the one I want apparently, but reading through the "see also":

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

None of them seem to do what I need to do.

So it seems the way to go would be:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df.loc[df["col"] == "a", "col"] = "b"
df["col"] = df["col"].astype("category").cat.remove_unused_categories()

Which is far from what the warning message suggests.

So at the end:

phofl commented 7 months ago

I think making rename_categories accept this might make the most sense, the solution you arrived at is probably the best case at the moment but obviously not great

cc @jbrockmendel

lesteve commented 7 months ago

FWIW I ended up with this (not great either), that I find a bit more readable (but this may depend on the reader :wink:):

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df['col'] = df['col'].astype(object).replace(to_replace="a", value="b").astype("category")
jbrockmendel commented 7 months ago

i think eventually we want users to do obj.replace('a', 'b').cat.remove_unused_categories(). That works now, but the .replace issues a warning. i guess we could update the warning message to suggest this pattern for that particular use case

adrinjalali commented 7 months ago

@jbrockmendel your code gives this warning now:

FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.

I'm not sure if you want to remove the warning in this case, or to suggest a different solution?

stuarteberg commented 5 days ago

Thanks for this thread.

That works now, but the .replace issues a warning. i guess we could update the warning message to suggest this pattern for that particular use case.

The warning says:

In a future version, replace will only be used for cases that preserve the categories.

I would have expected a warning only if I were introducing NEW categories. If I'm just consolidating existing categories, there is no need for the dtype to change (thus, the categories can be preserved, even if some are now unused). Why is a warning necessary at all?