scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 602 forks source link

Issue with sanitize_anndata() in plotting functions with subsets of anndata objects are passed. #166

Open LuckyMD opened 6 years ago

LuckyMD commented 6 years ago

I get quite a strange scanpy error, which appears a bit stochastic... This is has happened for the first time in version 1.1.

I am trying to get a scatter plot of a subsetted anndata object like this: p4 = sc.pl.scatter(adata[adata.obs['n_counts']<10000 ,:], 'n_counts', 'n_genes', color='mt_frac')

When I do this the first time round, I get this error message about categorical variables from sanitize_anndata (none of which are actually used in the call).

AttributeError                            Traceback (most recent call last)
<ipython-input-66-fc1479c238f7> in <module>()
      9 plt.show()
     10 
---> 11 p4 = sc.pl.scatter(adata[adata.obs['n_counts']<10000 ,:], 'n_counts', 'n_genes', color='mt_frac')
     12 p5 = sc.pl.scatter(adata, 'n_counts', 'n_genes', color='mt_frac')
     13 

~/scanpy/scanpy/plotting/anndata.py in scatter(adata, x, y, color, use_raw, sort_order, alpha, basis, groups, components, projection, legend_loc, legend_fontsize, legend_fontweight, color_map, palette, right_margin, left_margin, size, title, show, save, ax)
    162                 show=show,
    163                 save=save,
--> 164                 ax=ax)
    165 
    166         elif x in adata.var_keys() and y in adata.var_keys() and color not in adata.obs_keys():

~/scanpy/scanpy/plotting/anndata.py in _scatter_obs(adata, x, y, color, use_raw, sort_order, alpha, basis, groups, components, projection, legend_loc, legend_fontsize, legend_fontweight, color_map, palette, right_margin, left_margin, size, title, show, save, ax)
    281         ax=None):
    282     """See docstring of scatter."""
--> 283     sanitize_anndata(adata)
    284     if legend_loc not in VALID_LEGENDLOCS:
    285         raise ValueError(

~/scanpy/scanpy/utils.py in sanitize_anndata(adata)
    481 # backwards compat... remove this in the future
    482 def sanitize_anndata(adata):
--> 483     adata._sanitize()
    484 
    485 

~/anndata/anndata/base.py in _sanitize(self)
   1284                     if len(c.categories) < len(c):
   1285                         df[key] = c
-> 1286                         df[key].cat.categories = df[key].cat.categories.astype('U')
   1287                         logg.info(
   1288                             '... storing \'{}\' as categorical'

~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3608         if (name in self._internal_names_set or name in self._metadata or
   3609                 name in self._accessors):
-> 3610             return object.__getattribute__(self, name)
   3611         else:
   3612             if name in self._info_axis:

~/anaconda3/lib/python3.6/site-packages/pandas/core/accessor.py in __get__(self, instance, owner)
     52             # this ensures that Series.str.<method> is well defined
     53             return self.accessor_cls
---> 54         return self.construct_accessor(instance)
     55 
     56     def __set__(self, instance, value):

~/anaconda3/lib/python3.6/site-packages/pandas/core/categorical.py in _make_accessor(cls, data)
   2209     def _make_accessor(cls, data):
   2210         if not is_categorical_dtype(data.dtype):
-> 2211             raise AttributeError("Can only use .cat accessor with a "
   2212                                  "'category' dtype")
   2213         return CategoricalAccessor(data.values, data.index,

AttributeError: Can only use .cat accessor with a 'category' dtype

Then, I comment out the respective line of code, run the whole thing again, and it works. And when I uncomment the line it works fine again.

When I comment the line for the first time, I get a couple of lines displayed in the output saying:

... 'donor' was turned into a categorical variable ... 'gene_symbols' was turned into a categorical variable

or something like that...

My theory is that sanitize_anndata() detects that these variables should be categorical variables and tries to convert them into categoricals. As this sc.pl.scatter call is the first time sanitize_anndata() is called after the variables are read in, this is the first time this conversion would take place. However, I am calling the sc.pl.scatter() on a subsetted anndata object, so it somehow cannot do the conversion. Once I call sc.pl.scatter on a non-subsetted anndata object once, the conversion can take place and I can subsequently call sc.pl.scatter also on a subsetted anndata object.

If this is true, I can see why this is happening. However I feel this behaviour will be quite puzzling to a typical user. Maybe sanitize_anndata() should be called before plotting (probably hard to implement), or the plotting functions should have a parameter to plot only a subset of the data. That way sanitize_anndata can be called on the whole anndata object every time as there is no longer a reason to pass a view of the object. You could then test if a view is being passed to sanitize anndata, and then say "please don't pass subsetted anndata objects to plotting functions" or something like that.

falexwolf commented 6 years ago

Yes, this is related to the fact that sanitize_anndata cannot be meaningfully applied to a view of AnnData. You're right that one should also account for this case... I'll give it a thought. At least there should be a proper error hinting people to call sc.utils.sanitize_anndata when trying the call you mention.

Thank you very much for pointing this out. :smile: It should have happened also before version 1.1, though.

gokceneraslan commented 5 years ago

I have something that might be related:

ad = ad[ad.obs['cell type'] != 'nan'].copy()
assert np.all(ad.obs['cell type'] != 'nan')
sc.utils.sanitize_anndata(ad)
assert np.all(ad.obs['cell type'] != 'nan')

This fails in the second assert:

AssertionError                            Traceback (most recent call last)
<ipython-input-103-2f44e51fdcae> in <module>
      8 assert np.all(ad.obs['cell type'] != 'nan')
      9 sc.utils.sanitize_anndata(ad)
---> 10 assert np.all(ad.obs['cell type'] != 'nan')
     11 
     12 

AssertionError: 

It's really black magic, any ideas?

PS: nans are really string, not proper NaNs.

ivirshup commented 5 years ago

@gokceneraslan are there actually nans in there? Could be related to https://github.com/theislab/anndata/issues/141.

gokceneraslan commented 5 years ago

Yes there are, and this is how I realized it. I saw them in the plots and wondered why they show up after removing them.

gokceneraslan commented 5 years ago

Oh you mean real NaNs, no there is not.

dparmaksiz16 commented 1 year ago

I'm having this issue where I read in and merge multiple anndata's with concat. I can't run any of the plotting functions because I get this error. I tried to convert all object/string obs to categorical (except obs names) but I can't really get around it at all.