scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
578 stars 154 forks source link

DataFrameView in pandas 2.1.2 changed behavior, breaking AnnData in several ways #1210

Closed LucaMarconato closed 1 year ago

LucaMarconato commented 1 year ago

Please make sure these conditions are met

Report

I am getting an infinite recursion due most likely to this pandas bug: https://github.com/pandas-dev/pandas/issues/52927. The bug, which appeared a few days ago (probably the latest pandas release), is triggered by a code that I use to populate old categories that gets dropped after data subsetting (to have a workaround on this: https://github.com/scverse/anndata/issues/890).

The latest main from pandas is supposed to fix the problem, but it looks like it doesn't. Maybe I should report the bug to pandas, I will try to reproduce it via pandas code only now.

Code:

##
from anndata import AnnData
import pandas as pd

def _inplace_fix_subset_categorical_obs(subset_adata: AnnData, original_adata: AnnData) -> None:
    """
    Fix categorical obs columns of subset_adata to match the categories of original_adata.

    Parameters
    ----------
    subset_adata
        The subset AnnData object
    original_adata
        The original AnnData object

    Notes
    -----
    See discussion here: https://github.com/scverse/anndata/issues/997
    """
    obs = subset_adata.obs
    for column in obs.columns:
        is_categorical = pd.api.types.is_categorical_dtype(obs[column])
        if is_categorical:
            c = obs[column].cat.set_categories(original_adata.obs[column].cat.categories)
            obs[column] = c

anndata = AnnData(obs=pd.DataFrame({"a": pd.Categorical(["a", "b", "c"])}))
subset = anndata[anndata.obs["a"] == "a"]
assert subset.obs['a'].cat.categories.tolist() == ["a"]
_inplace_fix_subset_categorical_obs(subset, anndata)

Traceback:

_inplace_fix_subset_categorical_obs(subset, anndata)
<ipython-input-2-91050d1e96e1>:21: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead
  is_categorical = pd.api.types.is_categorical_dtype(obs[column])
<ipython-input-2-91050d1e96e1>:24: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  obs[column] = c
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
...
...
...
...
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
Traceback (most recent call last):
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-191c4157c214>", line 1, in <module>
    _inplace_fix_subset_categorical_obs(subset, anndata)
  File "<ipython-input-2-91050d1e96e1>", line 24, in _inplace_fix_subset_categorical_obs
    obs[column] = c
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 79, in __setitem__
    container[idx] = value
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 79, in __setitem__
    container[idx] = value
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 79, in __setitem__
    container[idx] = value
  [Previous line repeated 2961 more times]
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 78, in __setitem__
    with view_update(*self._view_args) as container:
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 52, in view_update
    new = adata_view.copy()
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1586, in copy
    return self._mutated_copy()
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1527, in _mutated_copy
    return AnnData(**new)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 362, in __init__
    self._init_as_actual(
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 548, in _init_as_actual
    self._var = _gen_dataframe(
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/functools.py", line 889, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 186, in _gen_dataframe_df
    anno.columns = anno.columns.astype(str)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 1073, in astype
    result = Index(new_values, name=self.name, dtype=new_values.dtype, copy=False)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 482, in __new__
    name = maybe_extract_name(name, data, cls)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 7579, in maybe_extract_name
    if name is None and isinstance(obj, (Index, ABCSeries)):
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/dtypes/generic.py", line 44, in _instancecheck
    return _check(inst) and not isinstance(inst, type)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/dtypes/generic.py", line 38, in _check
    return getattr(inst, attr, "_typ") in comp
RecursionError: maximum recursion depth exceeded while calling a Python object

Versions

-----
anndata             0.10.2
session_info        1.0.0
-----
asciitree           NA
cloudpickle         3.0.0
cython_runtime      NA
cytoolz             0.12.2
dask                2023.10.1
dateutil            2.8.2
exceptiongroup      1.1.3
fasteners           0.17.3
gmpy2               2.1.2
h5py                3.9.0
importlib_metadata  NA
jinja2              3.1.2
markupsafe          2.1.3
mpmath              1.3.0
msgpack             1.0.6
natsort             8.4.0
numcodecs           0.12.1
numpy               1.24.4
packaging           23.2
pandas              2.2.0.dev0+471.g984d75543
psutil              5.9.5
pyarrow             13.0.0
pytz                2023.3.post1
scipy               1.11.3
six                 1.16.0
sphinxcontrib       NA
sympy               1.12
tblib               2.0.0
tlz                 0.12.2
toolz               0.12.0
torch               2.1.0
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
yaml                6.0.1
zarr                2.16.1
zipp                NA
zoneinfo            NA
-----
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ]
macOS-13.4.1-arm64-arm-64bit
-----
Session information updated at 2023-10-28 17:20
LucaMarconato commented 1 year ago

The bug seems to be in anndata, it is triggered by obs[column] = c. The type of obs[column] is a anndata._core.views.DataFrameView, I am trying to get around this, maybe avoiding assigning the column to the view can help fix the problem.

LucaMarconato commented 1 year ago

I managed to fix by changing obs = subset_adata.obs to obs = pd.DataFrame(subset_adata.obs), and adding a subset_adata.obs = obs right before the function end.

LucaMarconato commented 1 year ago

Update: in spatialdata the bug appears also in other places. For instance I have an AnnData object containing a categorical column in obs:

matched_table.obs
Out[14]: 
           region  instance_id  numerical_in_obs
0  values_circles            0          0.321869
1  values_circles            1          0.594300
2  values_circles            2          0.337911
3  values_circles            3          0.391619
4  values_circles            4          0.890274
5  values_circles            5          0.227158
6  values_circles            6          0.623187
7  values_circles            7          0.084015

such that

type(matched_table.obs)
Out[11]: anndata._core.views.DataFrameView

Now, the following operation leads to the same recursion exception:

matched_table[:, value_key_values].obs

but if I delete the categorical column, then the problem disappears.

In my case I'll patch the bug by making an explicit instantiation of matched_table.obs, but I think this bug should be fixed here in AnnData.

LucaMarconato commented 1 year ago

Found another piece of code affected: table = table[table.obs[region_key].isin(coordinate_system)].copy(), I'll patch also this for the time being.

flying-sheep commented 1 year ago

Well, DataFrameView is there for a reason. AnnData’s views are lightweight objects that represent slices of other AnnData objects. They’re copy on write, so setting an attribute on them like you do in your workaround makes them into non-views.

So I’m pretty sure that what you’re actually doing in your workaround isn’t just setting obs to a dataframe, instead it’s the same as doing adata = adata.copy(), only indirectly triggered by setting a field on the AnnData object. I’d recommend replacing your workaround code with that, seeing adata.obs = pd.DataFrame(adata.obs) is a confusing way to say adata = adata.copy() # make view into actual to work around https://github.com/scverse/anndata/issues/1210

A real fix would probably be to change DataFrameView to not trigger recursive behavior. Do you have an idea how that could be done?

flying-sheep commented 1 year ago

I see, the bugfix for https://github.com/pandas-dev/pandas/issues/52927 hasn’t made it into a release yet, so the best workaround is to just set a pandas dependency specifier pandas !=2.1.2.

ivirshup commented 1 year ago

@flying-sheep, using the pandas nightly wheel does not solve this issue, so I don't think a new release of pandas will fix this.

So I think we still either need to do:

A real fix would probably be to change DataFrameView to not trigger recursive behavior. Do you have an idea how that could be done?

Or fix it upstream

ivirshup commented 1 year ago

I also get this behavior for setting any column of a dataframe, not just categorical ones. E.g.:

import anndata as ad, pandas as pd, numpy as np

adata = ad.AnnData(
    obs=pd.DataFrame(
        {"b": [1, 2, 3]},
        index=list("abc")
    )
)
v = adata[[0], :]

v.obs["b"] = 3

Also triggers the recursion error.

As does:

v.obs.drop("b")
ivirshup commented 1 year ago

I think I've found the problem:

import anndata as ad, pandas as pd, numpy as np

adata = ad.AnnData(
    obs=pd.DataFrame(
        {"b": [1, 2, 3]},
        index=list("abc")
    )
)
v = adata[[0], :]

type(v.obs.copy())

in pandas 2.1.2:

anndata._core.views.DataFrameView

in pandas 2.1.1:

pandas.core.frame.DataFrame

It looks like pd.DataFrame._constructor_from_mgr is what changed.

This is a behavior change in a bug fix release of pandas, so possibly is a new pandas bug in and of itself.

It's unclear to me whether https://github.com/pandas-dev/pandas/issues/52927 is relevant to this bug in anndata

ivirshup commented 1 year ago

This looks relevant: https://github.com/pandas-dev/pandas/issues/55120

LucaMarconato commented 1 year ago

Thanks for the info. @flying-sheep I am quite sure I tried also adata = andata.copy() but it didn't work, that's why I am doing the conversion on the obs.

flying-sheep commented 1 year ago

Very weird! If the class is still a view, it should try updating its parent AnnData object.

Well, I hope pandas reverts this and until their next major release we can come up with a good fix.

Adding the following to DataFrameView fixes all AnnData tests, and all scanpy tests except for scanpy/tests/test_pca.py::test_pca_warnings.

And that test failure is unrelated ```pytb def test_pca_warnings(array_type, zero_center, pca_params): svd_solver, expected_warning = pca_params A = array_type(A_list).astype("float32") adata = AnnData(A) if expected_warning is not None: with pytest.warns(UserWarning, match=expected_warning): sc.pp.pca(adata, svd_solver=svd_solver, zero_center=zero_center) else: with warnings.catch_warnings(record=True) as record: sc.pp.pca(adata, svd_solver=svd_solver, zero_center=zero_center) > assert len(record) == 0 E assert 1 == 0 E + where 1 = len([]) scanpy/tests/test_pca.py:118: AssertionError ```
def copy(self, deep: bool = True) -> pd.DataFrame:
    """Create a non-view copy of the DataFrame."""
    return pd.DataFrame(super().copy(deep=deep))

A possible issue is that the tests seemed to run pretty slow, so maybe it breaks some optimization? Maybe I just used too many threads for my poor M1 CPU.

@ivirshup also noted that a .groupby(...) on a view reproduces the error, so we should add a test for that.

flying-sheep commented 1 year ago

1223 shows a different symptom of the same upstream behavior change, so I renamed this issue to catch them all.

dsm-72 commented 1 year ago

I am also noticing something strange with anndata 0.10.3 vs anndata 0.10.1 and pandas 2.0.3

File ~/.conda/envs/organoids/lib/python3.11/site-packages/anndata/compat/__init__.py:400, in _map_cat_to_str(cat)
    397 def _map_cat_to_str(cat: pd.Categorical) -> pd.Categorical:
    398     if _parse_version(pd.__version__) >= _parse_version("2.0"):
    399         # Argument added in pandas 2.0
--> 400         return cat.map(str, na_action="ignore")
    401     else:
    402         return cat.map(str)

TypeError: Categorical.map() got an unexpected keyword argument 'na_action'
flying-sheep commented 1 year ago

@dsm-72 please report that separately, it doesn’t have anything to do with DataFrameView.

ivirshup commented 1 year ago

PR to fix, which is slated for the next pandas bug fix release: https://github.com/pandas-dev/pandas/pull/55764. We can close if if/ when the PR to pandas merges

ivirshup commented 1 year ago

Merged!