Open sinhrks opened 9 years ago
So i would operate on both values/categories:
intersection
you should take the intersection of the categories as well
union
take the union of categories.
difference
has to take the categories of lhs
sym_diff
take the sym_diff of the categories (so make this a CI as well)
One option to consider is to (for now) only allow these operations with indexes with the same categories
I think you simply call self._create_categorical
on the rhs to coerce nicely
IMO (and I think not all agree here), a Categorical
defines a new type and is therefore similar to an int
or a string
(e.g. a number of type int
can be one value of int_min..int_max
, similar to a value in a Categorical
, which can only be one of the categories
in that Categorical
).
Therefore a CategoricalIndex
should behave similar as two index of type int if the categories
(and ordered
) are the same and similar to one int and one string index f they have different categories
/ ordered
.
So:
>>> pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index(["2", "3", "4", "2", "3", "4"]))
Index([], dtype='object')
>>> pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# because the underlying categoricals have different categories [1,2,3] and [2,3,4}
Index([], dtype='object')
See also this:
>>> pd.Categorical([1,2,3], ordered=True) > pd.Categorical([2,3,4], ordered=True)
TypeError: Categoricals can only be compared if 'categories' are the same
>>> 1 > "2" # on py3, py2 is ...
TypeError: unorderable types: int() > str()
For reference, R doesn't care the order of categories and remove duplicated categories.
intersect(as.factor(c(1, 2, 3)), as.factor(c(2, 3, 4)))
# [1] "2" "3"
intersect(as.factor(c(1, 2, 3, 1, 2, 3)), as.factor(c(2, 3, 4, 2, 3, 4)))
# [1] "2" "3"
intersect(c(1, 2, 3, 1, 2, 3), c(2, 3, 4, 2, 3, 4))
# [1] 2 3
Let me summarize current opinions and choises. If I misunderstand, please lmk:
Category order is identical | Category order is different | |
---|---|---|
Categories are identical | Perform set ops against values and categories. Result of values should be identical as the normal index's result. | Ignore order and perform set ops (1), return empty(2) or raise error(3)? |
Categories are different | - | Ignore order and perform set ops (1), return empty(2) or raise error(3)? |
Resurrecting this as part of my CategoricalDtype refactor. The semantics in union_categoricals
are good for union
I think:
Currently CategoricalIndex.union(other)
discards the .ordered
, which isn't great.
In [22]: a = pd.CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=True)
In [23]: b = pd.CategoricalIndex(['b', 'c'], categories=['a', 'b', 'c'], ordered=True)
In [24]: a.union(b).ordered
Out[24]: False
I think we'll follow those rules on the categories for each of the set operations.
Actually, I think we can handle additional cases with union_categories
when both are ordered. Currently we require that categories match exactly when ordered. We could easily support
x | y
when x
is a strict subset of y
: {a < b} | {a < b < c} -> {a < b < c}
x | y
when x - y
are all greater than the max(y), or less then min(y).
e.g. {a < b < c < d} | {a < b < c} -> {a < b < c < d}
We could even support union over categoricals with "gaps" like
{a < b < d < e} | {a < b < c < d < e} -> { a < b < c < d < e}
These rules should work for intersect, difference, and symmetric difference too.
You guys probably already know this, but in case not, FYI: This is the current (24.2) behavior for union of categorical indices, which makes it difficult to do anything involving two slightly different categorical indices:
>>>pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
CategoricalIndex([1, 2, 4, nan], categories=[1, 2, 4], ordered=False, dtype='category')
not
CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 4, 3], ordered=False, dtype='category')
# or something
Derived from #10157. Would like to clarify what these results should be. Basically, I think:
CategoricalIndex
Index
which has the same original values.category
should only include categories which the result actually has.Followings are current results.
intersection
union
Doc says "Form the union of two Index objects and sorts if possible". I'm not sure whether the last sentence says "raise error if sort is impossible" or "not sort if impossible"?
difference
sym_diff