pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

BUG: index.intersection produces wrong result when using multiindex, categorie and sorted index #49974

Closed Hanspagh closed 1 year ago

Hanspagh commented 1 year ago

might be a duplicate of https://github.com/pandas-dev/pandas/issues/49337, but is still an issue on 1.5.2

Pandas version checks

Reproducible Example

import pandas as pd

a = pd.Categorical(["a","b"], categories=["a", "b"])
b = pd.Categorical(["a","b"], categories=["b", "a"])

one = pd.Categorical(["1","2"], categories=["1", "2"])
two = pd.Categorical(["1","2"], categories=["2", "1"])

Failing case

dfa = pd.DataFrame({"x": a, "y": one}).set_index(["x", "y"]).sort_index()
dfb = pd.DataFrame({"x": b, "y": two}).set_index(["x", "y"]).sort_index()
print(dfa.index.intersection(dfb.index))
print(dfb.index.intersection(dfa.index))

Output

MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('b', '2'),
            ('a', '1')],
           names=['x', 'y'])
MultiIndex([('b', '2')],
           names=['x', 'y'])
MultiIndex([('b', '2')],
           names=['x', 'y'])

('a', '1') is missing from the intersection

Expected case

dfa = pd.DataFrame({"x": a, "y": one}).set_index(["x", "y"])
dfb = pd.DataFrame({"x": b, "y": two}).set_index(["x", "y"])
print(dfa.index)
print(dfb.index)
print(dfa.index.intersection(dfb.index))
print(dfb.index.intersection(dfa.index))

Output

MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])
MultiIndex([('a', '1'),
            ('b', '2')],
           names=['x', 'y'])

Issue Description

When creating a multi-index with categories where the order of the categories is not the same, the result of intersection is missing values. Looking at the code, it looks like we do certain perf optimizations when both indexes are sorted, I assume this sorting is expected to be the same, which is not the case when construction categories like above. It does seem to work to .join the two frames

This might be an unsupported way to use pandas, if that is the case you can just ignore the above. The reason I ended up with different sorted categories is when using pyarrow and reading parquet data with options strings_as_categorials this will create categories in the order of how the values are seen in the column, instead of alphabetic.

Expected Behavior

The correct overlap of the indexes as shown above.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.9.13.final.0 python-bits : 64 OS : Darwin OS-release : 21.4.0 Version : Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.5.2 numpy : 1.23.5 pytz : 2022.6 dateutil : 2.8.2 setuptools : 58.1.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.7.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 10.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None None
phofl commented 1 year ago

Works on main and is a duplicate of the other issue