Closed gregjohnso closed 9 months ago
I think this is the culprit: https://github.com/scverse/anndata/blob/master/anndata/_core/anndata.py#L345
Thanks for opening this issue! I've been meaning to open this myself.
We remove unused categories for largely historical reasons. We were early adopters of pandas.Categorical
s, and back in the day having unused categories would cause many downstream operations to error.
But now I would like to change this, since it's often useful to know that you have a subset of possible annotations – especially when you're working with multiple subsets of a larger dataset.
Downstream packages would still likely run into errors if this changed suddenly. Maybe this could start as an opt-in behavior to give packages some time to adjust?
Fixing this issue is really important to my work as well; any way this can be prioritized?
To add the context from the duplicated issue here:
Downstream packages would still likely run into errors if this changed suddenly. Maybe this could start as an opt-in behavior to give packages some time to adjust?
what kind of problem do you expect?
like looping over the categories and expecting them all to be present in the vactor?
Maybe this could start as an opt-in behavior to give packages some time to adjust?
How would the API look like? The current code lives in _init_as_view
, and people run into the behavior when slicing (which doesn’t accept keyword arguments).
I don’t think opt-in is easily possible or would be discovered by many.
Coming back to this
@flying-sheep
like looping over the categories and expecting them all to be present in the vactor?
Yes, which I know will cause problems in scanpy because it used to. I believe that's why we even do this now. This is downstream of pandas functions which give unexpected results, like groupby
:
import pandas as pd
df = pd.DataFrame({"cat": pd.Categorical.from_codes([0, 2, 3, 2], list("abcd"))})
df.groupby("cat").groups
<ipython-input-1-b4836b14113b>:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
df.groupby("cat").groups
Out[1]: {'a': [0], 'b': [], 'c': [1, 3], 'd': [2]}
How would the API look like?
I think it's configuration. anndata.config["keep_unused_categories"]
/ environment variable/ context manager.
I don’t think opt-in is easily possible or would be discovered by many.
I think there is an opt-in period followed by an opt-out period. This at least lets us test packages with the new setting, so once it becomes opt-out things aren't broken.
Which of (a new) config
object, environment variable, or context manager should we aim for? I would think either config
or using uns
would make sense. But if it's config
we should probably split the work up into this issue and then a separate one creating AnnData.config
Environment variable that is held within the package (and editable) separately. Look how other packages handle this (maybe pydantic?).
I'm trying to figure out how to retain categorical values when subsetting an anndata object.
The issue im having is as follows: When I create an anndata object, I can assign a categorical variable to a column in obs:
output:
If we subset the DATAFRAME to label "c0", we see that it retains all categories
output:
but if we subset the ANNDATA to "c0", we see that it loses all categories
output:
Is this expected? My intuition is that we should not expect different behavior between the anndata and dataframe objects. Is there a way to change this behavior easily?