pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.59k stars 17.9k forks source link

API: Set ops for CategoricalIndex #10186

Open sinhrks opened 9 years ago

sinhrks commented 9 years ago

Derived from #10157. Would like to clarify what these results should be. Basically, I think:

Followings are current results.

intersection

# for reference
pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index([2, 3, 4, 2, 3, 4]))
# Int64Index([2, 2, 3, 3], dtype='int64')

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# CategoricalIndex([2, 2, 3, 3], categories=[1, 2, 3], ordered=False, dtype='category')
# -> Is this OK or it should have categories=[2, 3]?

union

Doc says "Form the union of two Index objects and sorts if possible". I'm not sure whether the last sentence says "raise error if sort is impossible" or "not sort if impossible"?

pd.Index([1, 2, 4]).union(pd.Index([2, 3, 4]))
# Int64Index([1, 2, 3, 4], dtype='int64')

pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
# CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 3, 4], ordered=False, dtype='category')
# -> Should be sorted?
pd.Index([1, 2, 3, 1, 2, 3]).union(pd.Index([2, 3, 4, 2, 3, 4]))
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
-> This should results Index([1, 2, 3, 1, 2, 3, 4, 4])?

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).union(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# TypeError: type() takes 1 or 3 arguments
# -> should raise understandable error, or Int64Index shouldn't raise (and return unsorted result?)

difference

pd.CategoricalIndex([1, 2, 4, 5]).difference(pd.CategoricalIndex([2, 3, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?

sym_diff

pd.CategoricalIndex([1, 2, 4, 5]).sym_diff(pd.CategoricalIndex([2, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?
jreback commented 9 years ago

So i would operate on both values/categories:

intersection you should take the intersection of the categories as well union take the union of categories. difference has to take the categories of lhs sym_diff take the sym_diff of the categories (so make this a CI as well)

jorisvandenbossche commented 9 years ago

One option to consider is to (for now) only allow these operations with indexes with the same categories

jreback commented 9 years ago

I think you simply call self._create_categorical on the rhs to coerce nicely

jankatins commented 9 years ago

IMO (and I think not all agree here), a Categorical defines a new type and is therefore similar to an int or a string (e.g. a number of type int can be one value of int_min..int_max, similar to a value in a Categorical, which can only be one of the categories in that Categorical).

Therefore a CategoricalIndex should behave similar as two index of type int if the categories (and ordered) are the same and similar to one int and one string index f they have different categories / ordered.

So:

>>> pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index(["2", "3", "4", "2", "3", "4"]))
Index([], dtype='object')
>>> pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# because the underlying categoricals have different categories [1,2,3]  and [2,3,4}
Index([], dtype='object') 

See also this:

>>> pd.Categorical([1,2,3], ordered=True) > pd.Categorical([2,3,4], ordered=True)
TypeError: Categoricals can only be compared if 'categories' are the same
>>> 1 > "2" # on py3, py2 is ...
TypeError: unorderable types: int() > str()
sinhrks commented 9 years ago

For reference, R doesn't care the order of categories and remove duplicated categories.

intersect(as.factor(c(1, 2, 3)), as.factor(c(2, 3, 4)))
# [1] "2" "3"
intersect(as.factor(c(1, 2, 3, 1, 2, 3)), as.factor(c(2, 3, 4, 2, 3, 4)))
# [1] "2" "3"

intersect(c(1, 2, 3, 1, 2, 3), c(2, 3, 4, 2, 3, 4))
# [1] 2 3

Let me summarize current opinions and choises. If I misunderstand, please lmk:

Category order is identical Category order is different
Categories are identical Perform set ops against values and categories. Result of values should be identical as the normal index's result. Ignore order and perform set ops (1), return empty(2) or raise error(3)?
Categories are different - Ignore order and perform set ops (1), return empty(2) or raise error(3)?
TomAugspurger commented 7 years ago

Resurrecting this as part of my CategoricalDtype refactor. The semantics in union_categoricals are good for union I think:

Currently CategoricalIndex.union(other) discards the .ordered, which isn't great.

In [22]: a = pd.CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=True)

In [23]: b = pd.CategoricalIndex(['b', 'c'], categories=['a', 'b', 'c'], ordered=True)

In [24]: a.union(b).ordered
Out[24]: False

I think we'll follow those rules on the categories for each of the set operations.

TomAugspurger commented 7 years ago

Actually, I think we can handle additional cases with union_categories when both are ordered. Currently we require that categories match exactly when ordered. We could easily support

We could even support union over categoricals with "gaps" like

{a < b < d < e} | {a < b < c < d < e} -> { a < b < c < d < e}

These rules should work for intersect, difference, and symmetric difference too.

joseortiz3 commented 5 years ago

You guys probably already know this, but in case not, FYI: This is the current (24.2) behavior for union of categorical indices, which makes it difficult to do anything involving two slightly different categorical indices:

>>>pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
CategoricalIndex([1, 2, 4, nan], categories=[1, 2, 4], ordered=False, dtype='category')

not

CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 4, 3], ordered=False, dtype='category') 
# or something