Behaviour of Categorical inputs to sparse data structures

jnothman commented 6 years ago

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> c = pd.Categorical(list('abcabc'))
>>> c
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
>>> pd.Series(c).dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> pd.Series(c).to_sparse().dtype
dtype('O')
>>> pd.SparseArray(c)
[a, b, c, a, b, c]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3, 4, 5], dtype=int32)

>>> pd.SparseArray(c).dtype
dtype('O')
>>> pd.SparseSeries(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joel/anaconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/sparse/series.py", line 175, in __init__
    length = len(index)
TypeError: object of type 'NoneType' has no len()
>>> pd.DataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: category
Categories (3, object): [a, b, c]
>>> pd.SparseDataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([6], dtype=int32)

Problem description

Categoricals are upcast to object dtype when put into SparseArray and SparseDataFrame (or when calling Series.to_sparse()). This is inconsistent with the categorical dtype retained by dense Series and DataFrame.
SparseSeries raises an error when constructed with a categorical argument. This is inconsistent with the SparseArray and SparseDataFrame behaviour.

Expected Output

SparseDataFrame({'a': c})['a'].dtype == SparseSeries(c).dtype == SparseArray(c).dtype == Series(c).dtype

or at a minimum:

SparseSeries(c) raises no error, and produces object dtype.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: en_AU.UTF-8 pandas: 0+unknown pytest: None pip: 9.0.1 setuptools: 38.4.0 Cython: 0.27.3 numpy: 1.14.0 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

jreback commented 6 years ago

why would you actually want to do this? using object dtypes in sparse has very little utility and is barely supported.

jnothman commented 6 years ago

okay. so make it invalid. I'll admit i don't have a use case for it. it just came up when looking into the implementation of unstack and how to make that sparse. I still think the SparseSeries error is inappropriate.

On 17 Jan 2018 11:40 pm, "Jeff Reback" notifications@github.com wrote:

why would you actually want to do this? using object dtypes in sparse has very little utility and is barely supported.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/19278#issuecomment-358292162, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6y6ZQqux6x9BvzU5oaVt1Vkiijd0ks5tLeo4gaJpZM4Rg6sJ .

jreback commented 6 years ago

would appreciate a PR

cr458 commented 6 years ago

@jnothman am happy to have a look at this if it is needed. Am having a little trouble understanding what you mean 'make it invalid' though?

LEO-E-100 commented 6 years ago

I am currently working on this issue and am thinking that the solution is to add a check to the SparseSeries which checks if the data is a CategoricalDtype:

elif isinstance(data, CategoricalDtype):
                if dtype is not None:
                    data = data.astype(dtype)
                if index is None:
                    index = data.index.view()
                else:
                    data=data.reindex(index, copy=False)

However the boolean isinstance(c, CategoricalDtype) returns false even if c.dtype returns CategoricalDtype. I suspect I am missing something important here but I cannot find how to make this boolean true on a Categorical datatype.

For reference this elif block would be added at ~ line 174 of pandas.core.sparse.series.py.

TomAugspurger commented 6 years ago

However the boolean isinstance(c, CategoricalDtype) returns false even if c.dtype returns CategoricalDtype.

CategoricalDtype is the class of array.dtype for a categorical array. You could use

if is_categorical_dtype(data):
   ...

OmerJog commented 5 years ago

@LEO-E-100 are you still working on this issue?

TomAugspurger commented 5 years ago

We can re-purpose this issue to be for allowing SparsArray[ExtensionDtype, fill_value]. It's not exactly straightforward though.

pandas-dev / pandas