pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.9k stars 18.03k forks source link

BUG: #47920

Closed paul-leydier closed 2 years ago

paul-leydier commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd

a = pd.DataFrame({"A": ["toto", "tata", "tutu"]}, dtype="category")

print("pd.concat with a single df:")
print(pd.concat([a]).dtypes)  # Categoy

print("pd.concat with two identical df:")
print(pd.concat([a, a.copy()]).dtypes)  # Categoy

print("pd.concat with two df containing the same values of the category:")
b = pd.DataFrame({"A": ["tata", "tutu"]}, dtype="category")
print(pd.concat([a, b]).dtypes)  # Object

print("pd.concat with two df containing different values of the category:")
c = pd.DataFrame({"A": ["titi"]}, dtype="category")
print(pd.concat([a, c]).dtypes)  # Object

Issue Description

When concatenating DataFrames including categorical columns, the dtype of the column in the new DataFrame is inconsistent:

Not sure if this is a bug, or if it is by design / if I am missing something here.

Expected Behavior

Concatenating two DataFrame including a similar categorical column:

import pandas as pd

a = pd.DataFrame({"A": ["toto", "tata", "tutu"]}, dtype="category")
b = pd.DataFrame({"A": ["tata", "tutu"]}, dtype="category")
print(pd.concat([a, b]).dtypes)

Should output a categorical column:

A    category
dtype: object

Alternative solution: Concatenating a single DataFrame

import pandas as pd

a = pd.DataFrame({"A": ["toto", "tata", "tutu"]}, dtype="category")
print(pd.concat([a]).dtypes)

Should output an object dtype column, to be consistent with the "real" concatenation.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8d16504de035280a93fac8cd62040fcfb6e87dea python : 3.10.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.utf8 pandas : 0+untagged.29862.g8d16504 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 61.2.0 pip : 22.1.2 Cython : 0.29.32 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
phofl commented 2 years ago

Hi, thanks for your report. This behaves as expected and documented https://pandas.pydata.org/docs/user_guide/categorical.html#merging-concatenation