pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

API: union_categoricals in concat #14177

Open jorisvandenbossche opened 8 years ago

jorisvandenbossche commented 8 years ago

Follow up of https://github.com/pydata/pandas/pull/13767 (this part was left out of the final merged version). Summary copied from there:

Discussion: addition of new keyword to concat to union categoricals with different categories

PR #13767 proposed to add a union_categoricals keyword to concat, which lets you concatenate categoricals with different categories (like the union_categoricals function does). Default would be False.

There was some discussion about the name to use. @wesm suggested to use a dict to pass type specific options. But that is maybe something we should first explore further to see if there are other use cases for such a construct?

Dr-Irv commented 6 years ago

So I looked at the discussion in #13767 where @sinhrks had implemented the union_categoricals=True option to pd.concat(), and we need this option in our project.

Here's the use case. I have all the data sitting in a MySQL database. There are columns that would be better handled as categorical columns. We have a complex SQL join creating a simple 5 column table of 33 million rows. 3 of the columns should be categorical. Reading the table from CSV, pandas takes 1.5GB, before converting the 3 columns to categories. Reading the table using standard MySQL connectors and SQL alchemy takes 14.5GB. There is an option to SQL Alchemy that causes the server to chunk the data, and that reduces the memory footprint to 6.5GB. If I could then use the chunksize argument on pd.read_sql(), where I convert the 3 columns of each chunk to category dtype, and pass the resulting generator from read_sql() to pd.concat() with union_categoricals=True, then I could let the categories be dynamically defined as the data is read in from the SQL database.

I'm willing to look at what @sinhrks did and make a PR so we can get the memory usage down, or maybe we can just have him put that functionality in a PR to support this use case.

ivirshup commented 4 years ago

I have related use cases for this, or a broader "the return type should still be categorical". First I'll layout the use case, then give the specific pandas functions where this could be used.

The analysis problem: During data analysis (of single cell RNA-seq data), I generate multiple unsupervised clusterings for my dataset. I use these as a starting point for figuring out what categories of samples exist in my dataset. I'll want to manually curate these labelling based on downstream analysis and feedback from domain experts.

The pandas functions: pd.Series.map is useful here. If I want to merge two clusters I can do:

new_clusters = clusters.map(lambda x: {"cluster_a": "cluster_b"}.get(x, x))

But now I have an object series, since that's the behavior if the number of categories changes. I'd like to have a keyword argument that specified I still want categorical outputs.

Another case is pd.Series.where. I may want to replace labels for some observations with clusters from another.

s = pd.Series(list("aabcdd"), dtype="category") 
t = pd.Series(list("effggh"), dtype="category") 
s.where(lambda x: np.isin(x, ("a", "b")), t)

This currently returns an object series and throws a FutureWarning that it will be a value error unless I explicitly construct a new set of categories and cast to those. I'd rather that I could just specify that I want a categorical (either union or using the resulting set of categories) in the call. Basically, this feels verbose for something I do interactively:

# From this:
s.where(lambda x: np.isin(x, ("a", "b")), t)
# To:
new_cats = s.cat.categories.union(t.cat.categories)
s.cat.set_categories(new_cats).where(lambda x: x.isin(x, ("a", "b")), t.cat.set_categories(new_cats))

Edit: Oof. This comment was so bad it fails CI 😳: https://github.com/pandas-dev/pandas/runs/313172123 Raw logs

jbrockmendel commented 4 years ago

@jorisvandenbossche can the idea here be updated in a way consistent with the "dont special-case pandas-internal EAs" goal?

jorisvandenbossche commented 4 years ago

This is categorical-specific, AFAIK

ivirshup commented 4 years ago

Wouldn't the idea of adding a dict valued argument for type specific options make this more general (as mentioned at the top)? E.g.

pd.concat(
    {...},
    dtype_kwargs={pd.Categorical: {"union": True}}
)
jbrockmendel commented 1 year ago

Discussed this on the dev call today, agreed to add a keyword to pd.concat to opt in to this behavior.