Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
import pandas as pd
a = pd.Categorical(["a","b"], categories=["a", "b"])
b = pd.Categorical(["a","b"], categories=["b", "a"])
one = pd.Categorical(["1","2"], categories=["1", "2"])
two = pd.Categorical(["1","2"], categories=["2", "1"])
When creating a multi-index with categories where the order of the categories is not the same, the result of intersection is missing values. Looking at the code, it looks like we do certain perf optimizations when both indexes are sorted, I assume this sorting is expected to be the same, which is not the case when construction categories like above. It does seem to work to .join the two frames
This might be an unsupported way to use pandas, if that is the case you can just ignore the above. The reason I ended up with different sorted categories is when using pyarrow and reading parquet data with options strings_as_categorials this will create categories in the order of how the values are seen in the column, instead of alphabetic.
Expected Behavior
The correct overlap of the indexes as shown above.
might be a duplicate of https://github.com/pandas-dev/pandas/issues/49337, but is still an issue on 1.5.2
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas. This seems to have been fixed on main by this line https://github.com/pandas-dev/pandas/blob/4d0a4365870893e232f639af13fea44c7d3ff9d4/pandas/core/indexes/base.py#L3235, but the addention seems unrelated to this bug
Reproducible Example
Failing case
Output
('a', '1') is missing from the intersection
Expected case
Output
Issue Description
When creating a multi-index with categories where the order of the categories is not the same, the result of
intersection
is missing values. Looking at the code, it looks like we do certain perf optimizations when both indexes are sorted, I assume this sorting is expected to be the same, which is not the case when construction categories like above. It does seem to work to.join
the two framesThis might be an unsupported way to use pandas, if that is the case you can just ignore the above. The reason I ended up with different sorted categories is when using pyarrow and reading parquet data with options
strings_as_categorials
this will create categories in the order of how the values are seen in the column, instead of alphabetic.Expected Behavior
The correct overlap of the indexes as shown above.
Installed Versions