Closed LucaMarconato closed 1 year ago
The bug seems to be in anndata
, it is triggered by obs[column] = c
. The type of obs[column]
is a anndata._core.views.DataFrameView
, I am trying to get around this, maybe avoiding assigning the column to the view can help fix the problem.
I managed to fix by changing obs = subset_adata.obs
to obs = pd.DataFrame(subset_adata.obs)
, and adding a subset_adata.obs = obs
right before the function end.
Update: in spatialdata
the bug appears also in other places.
For instance I have an AnnData
object containing a categorical column in obs:
matched_table.obs
Out[14]:
region instance_id numerical_in_obs
0 values_circles 0 0.321869
1 values_circles 1 0.594300
2 values_circles 2 0.337911
3 values_circles 3 0.391619
4 values_circles 4 0.890274
5 values_circles 5 0.227158
6 values_circles 6 0.623187
7 values_circles 7 0.084015
such that
type(matched_table.obs)
Out[11]: anndata._core.views.DataFrameView
Now, the following operation leads to the same recursion exception:
matched_table[:, value_key_values].obs
but if I delete the categorical column, then the problem disappears.
In my case I'll patch the bug by making an explicit instantiation of matched_table.obs
, but I think this bug should be fixed here in AnnData
.
Found another piece of code affected: table = table[table.obs[region_key].isin(coordinate_system)].copy()
, I'll patch also this for the time being.
Well, DataFrameView is there for a reason. AnnData’s views are lightweight objects that represent slices of other AnnData objects. They’re copy on write, so setting an attribute on them like you do in your workaround makes them into non-views.
So I’m pretty sure that what you’re actually doing in your workaround isn’t just setting obs
to a dataframe, instead it’s the same as doing adata = adata.copy()
, only indirectly triggered by setting a field on the AnnData object. I’d recommend replacing your workaround code with that, seeing adata.obs = pd.DataFrame(adata.obs)
is a confusing way to say adata = adata.copy() # make view into actual to work around https://github.com/scverse/anndata/issues/1210
A real fix would probably be to change DataFrameView to not trigger recursive behavior. Do you have an idea how that could be done?
I see, the bugfix for https://github.com/pandas-dev/pandas/issues/52927 hasn’t made it into a release yet, so the best workaround is to just set a pandas dependency specifier pandas !=2.1.2
.
@flying-sheep, using the pandas nightly wheel does not solve this issue, so I don't think a new release of pandas will fix this.
So I think we still either need to do:
A real fix would probably be to change DataFrameView to not trigger recursive behavior. Do you have an idea how that could be done?
Or fix it upstream
I also get this behavior for setting any column of a dataframe, not just categorical ones. E.g.:
import anndata as ad, pandas as pd, numpy as np
adata = ad.AnnData(
obs=pd.DataFrame(
{"b": [1, 2, 3]},
index=list("abc")
)
)
v = adata[[0], :]
v.obs["b"] = 3
Also triggers the recursion error.
As does:
v.obs.drop("b")
I think I've found the problem:
import anndata as ad, pandas as pd, numpy as np
adata = ad.AnnData(
obs=pd.DataFrame(
{"b": [1, 2, 3]},
index=list("abc")
)
)
v = adata[[0], :]
type(v.obs.copy())
in pandas 2.1.2:
anndata._core.views.DataFrameView
in pandas 2.1.1:
pandas.core.frame.DataFrame
It looks like pd.DataFrame._constructor_from_mgr
is what changed.
This is a behavior change in a bug fix release of pandas, so possibly is a new pandas bug in and of itself.
It's unclear to me whether https://github.com/pandas-dev/pandas/issues/52927 is relevant to this bug in anndata
This looks relevant: https://github.com/pandas-dev/pandas/issues/55120
Thanks for the info. @flying-sheep I am quite sure I tried also adata = andata.copy()
but it didn't work, that's why I am doing the conversion on the obs.
Very weird! If the class is still a view, it should try updating its parent AnnData object.
Well, I hope pandas reverts this and until their next major release we can come up with a good fix.
Adding the following to DataFrameView fixes all AnnData tests, and all scanpy tests except for scanpy/tests/test_pca.py::test_pca_warnings
.
def copy(self, deep: bool = True) -> pd.DataFrame:
"""Create a non-view copy of the DataFrame."""
return pd.DataFrame(super().copy(deep=deep))
A possible issue is that the tests seemed to run pretty slow, so maybe it breaks some optimization? Maybe I just used too many threads for my poor M1 CPU.
@ivirshup also noted that a .groupby(...)
on a view reproduces the error, so we should add a test for that.
I am also noticing something strange with anndata 0.10.3 vs anndata 0.10.1 and pandas 2.0.3
File ~/.conda/envs/organoids/lib/python3.11/site-packages/anndata/compat/__init__.py:400, in _map_cat_to_str(cat)
397 def _map_cat_to_str(cat: pd.Categorical) -> pd.Categorical:
398 if _parse_version(pd.__version__) >= _parse_version("2.0"):
399 # Argument added in pandas 2.0
--> 400 return cat.map(str, na_action="ignore")
401 else:
402 return cat.map(str)
TypeError: Categorical.map() got an unexpected keyword argument 'na_action'
@dsm-72 please report that separately, it doesn’t have anything to do with DataFrameView.
PR to fix, which is slated for the next pandas bug fix release: https://github.com/pandas-dev/pandas/pull/55764. We can close if if/ when the PR to pandas merges
Merged!
Please make sure these conditions are met
Report
I am getting an infinite recursion due most likely to this pandas bug: https://github.com/pandas-dev/pandas/issues/52927. The bug, which appeared a few days ago (probably the latest pandas release), is triggered by a code that I use to populate old categories that gets dropped after data subsetting (to have a workaround on this: https://github.com/scverse/anndata/issues/890).
The latest main from
pandas
is supposed to fix the problem, but it looks like it doesn't. Maybe I should report the bug to pandas, I will try to reproduce it viapandas
code only now.Code:
Traceback:
Versions