pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.44k stars 17.86k forks source link

BUG: Concat does not preserve category if lists of categories do not match. #42840

Open yohplala opened 3 years ago

yohplala commented 3 years ago

Code Sample, a copy-pastable example

import pandas as pd
dfx = pd.DataFrame({'a':[1,2,3], 'cat':['a', 'b', 'a']}).astype({'cat':'category'})
dfy = pd.DataFrame({'a':[3,4,5], 'cat':['a', 'a', 'a']}).astype({'cat':'category'}) # no 'b' here
dfz=pd.concat([dfx,dfy])

dfz['cat']
Out[55]: 
0    a
1    b
2    a
3    a
4    a
5    a
Name: cat, dtype: object

Problem description

I am expecting (erroneously?) that resulting 'cat' column should be the merged categories of 'dfx' and 'dfy'.

Expected Output

dfz['cat']
Out[58]: 
0    a
1    b
2    a
3    a
4    a
5    a
Name: cat, dtype: category
Categories (2, object): ['a', 'b']

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa python : 3.8.8.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-63-generic Version : #71~20.04.1-Ubuntu SMP Thu Jul 15 17:46:08 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.1 numpy : 1.20.3 pytz : 2021.1 dateutil : 2.8.2 pip : 21.1.3 setuptools : 52.0.0.post20210125 Cython : 0.29.24 pytest : 6.2.4 hypothesis : None sphinx : 4.0.2 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.3 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : None fsspec : 2021.07.0 fastparquet : 0.7.0 gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 3.0.0 pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : None tables : None tabulate : 0.8.9 xarray : 0.18.2 xlrd : None xlwt : None numba : 0.53.1
yohplala commented 3 years ago

Root cause possibly shared with ticket #14016 ?