pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.55k stars 17.9k forks source link

API: `astype` method fails to raise errors for `category` data type #59899

Open noahblakesmith opened 3 weeks ago

noahblakesmith commented 3 weeks ago

Pandas version checks

Reproducible Example

import pandas as pd

col = pd.Series(["a", "b", "c"], dtype=str)
cat = pd.api.types.CategoricalDtype(categories=["a", "b"])

col = col.astype(dtype=cat, errors="raise")
print(col)

0      a
1      b
2    NaN
dtype: category
Categories (2, object): ['a', 'b']

Issue Description

No error is raised when recasting as a category, despite the presence of an undefined value, c. Rather, c is coerced to NaN.

This behavior appears inconsistent with that of other data types, such as int.

Expected Behavior

I believe an error should be raised.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.14.final.0 python-bits : 64 OS : Darwin OS-release : 23.6.0 Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.2 dateutil : 2.9.0.post0 setuptools : 72.1.0 pip : 24.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.9.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.4 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
asishm commented 3 weeks ago

Thanks for the report, could you please update the title to have a description?

That said, based on this comment - https://github.com/pandas-dev/pandas/issues/51074#issuecomment-1409344688 this is expected behavior

rhshadrach commented 3 weeks ago

This behavior appears inconsistent with that of other data types, such as int.

Can you give an example that demonstrates the inconsistency?

noahblakesmith commented 2 weeks ago

Sure thing @rhshadrach. Here is an example using int, which throws an error. I also tested float, "Int64", and "int64[pyarrow]", which produced similar errors.

import pandas as pd

col = pd.Series(["a", "b", "c"])
col = col.astype(dtype=int, errors="raise")

Traceback (most recent call last):
  File "./test.py", line 4, in <module>
    col = col.astype(dtype=int, errors="raise")
  File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/generic.py", line 6643, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 430, in astype
    return self.apply(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 363, in apply
    applied = getattr(b, f)(**kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 758, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 237, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
    values = _astype_nansafe(values, dtype, copy=copy)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/test/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 133, in _astype_nansafe
    return arr.astype(dtype, copy=True)
ValueError: invalid literal for int() with base 10: 'a'
rhshadrach commented 2 weeks ago

Thanks @noahblakesmith. I would not call this inconsistent since categorical dtype has it's own specialized semantics as @asishm mentioned. This is well-established and purposeful behavior, so it is also not a bug.

That said, there is agreement this is undesired behavior. This is very closely related, and may even be fixed by, #40996.