pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

BUG: Index.difference and Index.intersection doesn't preserve type of Index for some Index subclasses for corner cases #20040

Closed Dr-Irv closed 6 years ago

Dr-Irv commented 6 years ago

Code Sample, a copy-pastable example if possible

pi1 = pd.PeriodIndex(start='2000', end='2010', freq='A')
print(pi1.difference(pi1), pi1.intersection(pi1.drop(pi1)))

ci = pd.CategoricalIndex(['a','b','c'], categories=['a','b','c'])
print(ci.difference(ci), ci.intersection(ci.drop(ci)))

ri = pd.RangeIndex(start=1, stop=5)
print(ri.difference(ri), ri.intersection(ri.drop(ri)))

Problem description

The result of taking the difference of an Index for various Index subclasses and the Index produces a resulting Index that does not preserve the type of the subclass.

From a set algebra point of view, for a set S, S.difference(S) should equal S.intersection(nullset).

The output from the above is:

Index([], dtype='object') PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC')
Index([], dtype='object') CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category')
Index([], dtype='object') Int64Index([], dtype='int64')

There is some discussion in the pull request #19849, where I discovered this bug, but at request of @jreback, I have split this into a separate issue.

Expected Output

PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC') PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC')
CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category') CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category')
RangeIndex(start=0, stop=0, step=1) RangeIndex(start=0, stop=0, step=1)

Note that for RangeIndex, the result of the intersection operation is also incorrect.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.22.0 pytest: 3.3.2 pip: 9.0.1 setuptools: 38.4.0 Cython: 0.27.3 numpy: 1.14.0 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.6.6 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.1.2 openpyxl: 2.4.10 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.1 pymysql: 0.7.11.None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Dr-Irv commented 6 years ago

I'm willing to work on this, but can we have a discussion on the implementation? The suggested solution in the discussion in #19849 is to use self._shallow_copy([]), but that method doesn't work right for empty indexes, so I think it is easier to just have a method that creates an empty index, but preserves the other properties of the index (e.g., categories for CategoricalIndex, range step for RangeIndex, freq for PeriodIndex, etc.)

Alternatively, I can make self._shallow_copy([]) work for the various Index subclasses with an empty list argument.

gfyoung commented 6 years ago

@Dr-Irv : That seems like a good first attempt to patch this, though other options are welcome of course.

Dr-Irv commented 6 years ago

@gfyoung By "That seems", do you mean having a method to create an empty index, or fixing _shallow_copy([])

gfyoung commented 6 years ago

Oh, sorry! I was referring to fixing _shallow_copy([]).