pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.37k stars 17.83k forks source link

Setting with enlargement on categorical data #25383

Open 0phoff opened 5 years ago

0phoff commented 5 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame.from_dict({'reg': [0,1,2], 'cat':pd.Categorical(['a','b','b'], categories=['a','b','c','d'])})
print(df.dtypes)  # reg is int64, cat is categorical

df.loc[3] = (3, 'c')  # add row with categorical value that exist in categories
print(df.dtypes)  # reg is int64, cat is **object**

Problem description

There is no warning whatsoever, but still the dtype changes. In this dummy example this means we lose all information about the fact that 'd' is also a possible value. (So simply doing astype('category') wouldn't work here.)

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

I couldn't seem to find an issue about this. However I did find a few related things like performing concat and append on categoricals also changes dtypes. I would love these functions to have a keyword to control that behaviour (eg. perform union of categories), but this is a different issue that has already been discussed... (just letting you know that there are people out there who would love this feature, instead of having to meddle with pandas.api.types.union_categoricals)

Expected Output

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.
If people don't care about the categorical, they can always call .astype('object') before adding the row?

I think this solution is also in the spirit of 'explicit is better than implicit`?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-33-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.0 pytest: 4.1.1 pip: 18.1 setuptools: 40.2.0 Cython: 0.29.2 numpy: 1.16.0 scipy: 1.1.0 pyarrow: 0.12.0 xarray: None IPython: 6.5.0 sphinx: 1.7.9 patsy: None dateutil: 2.7.5 pytz: 2018.9 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.9 feather: 0.4.0 matplotlib: 2.2.3 openpyxl: 2.5.12 xlrd: 1.2.0 xlwt: None xlsxwriter: None lxml.etree: 4.2.5 bs4: 4.7.1 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
WillAyd commented 5 years ago

Not sure I see the issue here - from the code posted it looks like you are trying to mix tuples with categorical data which should be an object.

Do you mean to be using the add_categories method:

http://pandas.pydata.org/pandas-docs/stable//user_guide/categorical.html#appending-new-categories

jreback commented 5 years ago

this seems likely the same issues as you mentioned above; append and concat are used in indexing expansion

the core issue should be addressed before this

note that indexing expansion is pretty inefficient and might be removed in the future ; better to explicitly append (which is also inefficient if doing it many times but it’s more obvious what is happening)

0phoff commented 5 years ago

Not sure I see the issue here - from the code posted it looks like you are trying to mix tuples with categorical data which should be an object.

You can use .loc[non-existing index] = ('colval1', 'colval2', ...) to set a new row, which is what I'm doing. Not sure if you can wrap such a value in a categorical, but if that's the case, it still seems quite a burden to do.

add_categories is not what I want. I do not want to add an extra possible category, I want to add an extra row of data in a dataframe that uses one or more categorical columns.


this seems likely the same issues as you mentioned above; append and concat are used in indexing expansion

the core issue should be addressed before this

I don't know enough of the pandas internals, but it seems kind of logical. I think overall support for these kinds of merging operations with categoricals is lacking in pandas.

note that indexing expansion is pretty inefficient and might be removed in the future ; better to explicitly append (which is also inefficient if doing it many times but it’s more obvious what is happening)

I thought it was just some sugar coating on top of append() with a nicer syntax? Is it that much more compute time, besides checking whether the index is already in the dataframe?