scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 154 forks source link

Option to retain categories in columns when subsetting anndata object. #890

Closed gregjohnso closed 9 months ago

gregjohnso commented 1 year ago

I'm trying to figure out how to retain categorical values when subsetting an anndata object.

The issue im having is as follows: When I create an anndata object, I can assign a categorical variable to a column in obs:

import numpy as np
import anndata as ad

n_data = 100
n_cagegories = 3

x = np.random.normal(size=(n_data, 2))
categories = np.random.choice(n_cagegories, size=n_data)

# Randomly assign categories ["c0", "c1", "c2"]
categories = np.array([f"c{c}" for c in categories])

adata = ad.AnnData(x, obs={"labels": categories})
# Make the column categorical
adata.obs["labels"] = adata.obs["labels"].astype("category")

# Observe this is a categorical column
print(adata.obs['labels'].cat.categories)

output:

Index(['c0', 'c1', 'c2'], dtype='object')

If we subset the DATAFRAME to label "c0", we see that it retains all categories

# if we subset the DATAFRAME to "c0", we see that it retains all categories
c0_inds = adata.obs["labels"] == "c0"

df_obs = adata.obs[c0_inds]
print(df_obs['labels'].cat.categories)

output:

Index(['c0', 'c1', 'c2'], dtype='object')

but if we subset the ANNDATA to "c0", we see that it loses all categories

# if we subset the ANNDATA to "c0", we see that it loses all categories
adata_c0 = adata[c0_inds]
print(adata_c0.obs['labels'].cat.categories)

output:

Index(['c0'], dtype='object')

Is this expected? My intuition is that we should not expect different behavior between the anndata and dataframe objects. Is there a way to change this behavior easily?

gregjohnso commented 1 year ago

I think this is the culprit: https://github.com/scverse/anndata/blob/master/anndata/_core/anndata.py#L345

ivirshup commented 1 year ago

Thanks for opening this issue! I've been meaning to open this myself.

We remove unused categories for largely historical reasons. We were early adopters of pandas.Categoricals, and back in the day having unused categories would cause many downstream operations to error.

But now I would like to change this, since it's often useful to know that you have a subset of possible annotations – especially when you're working with multiple subsets of a larger dataset.

Minimizing downstream impact

Downstream packages would still likely run into errors if this changed suddenly. Maybe this could start as an opt-in behavior to give packages some time to adjust?

rachelnhovde commented 1 year ago

Fixing this issue is really important to my work as well; any way this can be prioritized?

flying-sheep commented 1 year ago

To add the context from the duplicated issue here:

Downstream packages would still likely run into errors if this changed suddenly. Maybe this could start as an opt-in behavior to give packages some time to adjust?

what kind of problem do you expect?

like looping over the categories and expecting them all to be present in the vactor?

Maybe this could start as an opt-in behavior to give packages some time to adjust?

How would the API look like? The current code lives in _init_as_view, and people run into the behavior when slicing (which doesn’t accept keyword arguments).

I don’t think opt-in is easily possible or would be discovered by many.

ivirshup commented 1 year ago

Coming back to this

@flying-sheep

like looping over the categories and expecting them all to be present in the vactor?

Yes, which I know will cause problems in scanpy because it used to. I believe that's why we even do this now. This is downstream of pandas functions which give unexpected results, like groupby:

import pandas as pd
df = pd.DataFrame({"cat": pd.Categorical.from_codes([0, 2, 3, 2], list("abcd"))})
df.groupby("cat").groups
<ipython-input-1-b4836b14113b>:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby("cat").groups
Out[1]: {'a': [0], 'b': [], 'c': [1, 3], 'd': [2]}

How would the API look like?

I think it's configuration. anndata.config["keep_unused_categories"]/ environment variable/ context manager.

I don’t think opt-in is easily possible or would be discovered by many.

I think there is an opt-in period followed by an opt-out period. This at least lets us test packages with the new setting, so once it becomes opt-out things aren't broken.

ilan-gold commented 11 months ago

Which of (a new) config object, environment variable, or context manager should we aim for? I would think either config or using uns would make sense. But if it's config we should probably split the work up into this issue and then a separate one creating AnnData.config

ilan-gold commented 11 months ago

Environment variable that is held within the package (and editable) separately. Look how other packages handle this (maybe pydantic?).