Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
a = pd.DataFrame({"A": ["toto", "tata", "tutu"]}, dtype="category")
print("pd.concat with a single df:")
print(pd.concat([a]).dtypes) # Categoy
print("pd.concat with two identical df:")
print(pd.concat([a, a.copy()]).dtypes) # Categoy
print("pd.concat with two df containing the same values of the category:")
b = pd.DataFrame({"A": ["tata", "tutu"]}, dtype="category")
print(pd.concat([a, b]).dtypes) # Object
print("pd.concat with two df containing different values of the category:")
c = pd.DataFrame({"A": ["titi"]}, dtype="category")
print(pd.concat([a, c]).dtypes) # Object
Issue Description
When concatenating DataFrames including categorical columns, the dtype of the column in the new DataFrame is inconsistent:
When concatenating a single DataFrame, the output column is categorical
When concatenating a single DataFrame with a copy of itself, the output column is categorical
When concatenating two DataFrames, the output is not categorical, but object
Not sure if this is a bug, or if it is by design / if I am missing something here.
Expected Behavior
Concatenating two DataFrame including a similar categorical column:
import pandas as pd
a = pd.DataFrame({"A": ["toto", "tata", "tutu"]}, dtype="category")
b = pd.DataFrame({"A": ["tata", "tutu"]}, dtype="category")
print(pd.concat([a, b]).dtypes)
Should output a categorical column:
A category
dtype: object
Alternative solution:
Concatenating a single DataFrame
import pandas as pd
a = pd.DataFrame({"A": ["toto", "tata", "tutu"]}, dtype="category")
print(pd.concat([a]).dtypes)
Should output an object dtype column, to be consistent with the "real" concatenation.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When concatenating DataFrames including categorical columns, the dtype of the column in the new DataFrame is inconsistent:
Not sure if this is a bug, or if it is by design / if I am missing something here.
Expected Behavior
Concatenating two DataFrame including a similar categorical column:
Should output a categorical column:
Alternative solution: Concatenating a single DataFrame
Should output an object dtype column, to be consistent with the "real" concatenation.
Installed Versions