Open jorisvandenbossche opened 8 years ago
So I looked at the discussion in #13767 where @sinhrks had implemented the union_categoricals=True
option to pd.concat()
, and we need this option in our project.
Here's the use case. I have all the data sitting in a MySQL database. There are columns that would be better handled as categorical columns. We have a complex SQL join creating a simple 5 column table of 33 million rows. 3 of the columns should be categorical. Reading the table from CSV, pandas takes 1.5GB, before converting the 3 columns to categories. Reading the table using standard MySQL connectors and SQL alchemy takes 14.5GB. There is an option to SQL Alchemy that causes the server to chunk the data, and that reduces the memory footprint to 6.5GB. If I could then use the chunksize
argument on pd.read_sql()
, where I convert the 3 columns of each chunk to category dtype, and pass the resulting generator from read_sql()
to pd.concat()
with union_categoricals=True
, then I could let the categories be dynamically defined as the data is read in from the SQL database.
I'm willing to look at what @sinhrks did and make a PR so we can get the memory usage down, or maybe we can just have him put that functionality in a PR to support this use case.
I have related use cases for this, or a broader "the return type should still be categorical". First I'll layout the use case, then give the specific pandas functions where this could be used.
The analysis problem: During data analysis (of single cell RNA-seq data), I generate multiple unsupervised clusterings for my dataset. I use these as a starting point for figuring out what categories of samples exist in my dataset. I'll want to manually curate these labelling based on downstream analysis and feedback from domain experts.
The pandas functions: pd.Series.map
is useful here. If I want to merge two clusters I can do:
new_clusters = clusters.map(lambda x: {"cluster_a": "cluster_b"}.get(x, x))
But now I have an object series, since that's the behavior if the number of categories changes. I'd like to have a keyword argument that specified I still want categorical outputs.
Another case is pd.Series.where
. I may want to replace labels for some observations with clusters from another.
s = pd.Series(list("aabcdd"), dtype="category")
t = pd.Series(list("effggh"), dtype="category")
s.where(lambda x: np.isin(x, ("a", "b")), t)
This currently returns an object series and throws a FutureWarning
that it will be a value error unless I explicitly construct a new set of categories and cast to those. I'd rather that I could just specify that I want a categorical (either union or using the resulting set of categories) in the call. Basically, this feels verbose for something I do interactively:
# From this:
s.where(lambda x: np.isin(x, ("a", "b")), t)
# To:
new_cats = s.cat.categories.union(t.cat.categories)
s.cat.set_categories(new_cats).where(lambda x: x.isin(x, ("a", "b")), t.cat.set_categories(new_cats))
Edit: Oof. This comment was so bad it fails CI 😳: https://github.com/pandas-dev/pandas/runs/313172123 Raw logs
@jorisvandenbossche can the idea here be updated in a way consistent with the "dont special-case pandas-internal EAs" goal?
This is categorical-specific, AFAIK
Wouldn't the idea of adding a dict valued argument for type specific options make this more general (as mentioned at the top)? E.g.
pd.concat(
{...},
dtype_kwargs={pd.Categorical: {"union": True}}
)
Discussed this on the dev call today, agreed to add a keyword to pd.concat to opt in to this behavior.
Follow up of https://github.com/pydata/pandas/pull/13767 (this part was left out of the final merged version). Summary copied from there:
Discussion: addition of new keyword to
concat
to union categoricals with different categoriesPR #13767 proposed to add a
union_categoricals
keyword toconcat
, which lets you concatenate categoricals with different categories (like theunion_categoricals
function does). Default would be False.There was some discussion about the name to use. @wesm suggested to use a dict to pass type specific options. But that is maybe something we should first explore further to see if there are other use cases for such a construct?