scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
576 stars 152 forks source link

Fix sc.pp.filter_genes implementation for h5py.SparseDataset #179

Open gokceneraslan opened 5 years ago

gokceneraslan commented 5 years ago

Right now sc.pp.filter_genes(adata.X, min_counts=10) works correctly for dense file-backed anndata since np.sum(adata.X) here is supported by h5py.Dataset (see https://github.com/h5py/h5py/blob/master/h5py/_hl/dataset.py#L762).

However the same call on a SparseDataset fails with the following error:

~/.anaconda3/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py in filter_genes(data, min_counts, min_cells, max_counts, max_cells, inplace, copy)
    230         number_per_gene = number_per_gene.A1
    231     if min_number is not None:
--> 232         gene_subset = number_per_gene >= min_number
    233     if max_number is not None:
    234         gene_subset = number_per_gene <= max_number

TypeError: '>=' not supported between instances of 'SparseDataset' and 'int'

because np.sum(adata.X) weirdly returns a SparseDataset without reducing any dimensions:

image

I can think of two solutions:

  1. Set the __array_ufunc__ method of SparseDatset (https://docs.scipy.org/doc/numpy/reference/arrays.classes.html#numpy.class.__array_ufunc__) to None which would state explicitly that numpy operations are not supported. Next, we can print a proper error in filter_cells/filter_genes which states such operations are not supported.

  2. Implement a sum function (which loads the data into memory) which would enable us to perform gene/cell filtering options. We can print a warning here and inform the user that data is loaded to the memory. Something like following works:

def sum(self, *args, **kwargs):
    return self.value.sum(*args, **kwargs)

image

ivirshup commented 5 years ago

That sum behavior is real weird. I'm pretty surprised numpy doesn't throw an error about that one. I'd go as far as to rename this issue to be about that, since I know filtering sparse backed datasets probably won't work due to how subsetting is defined (it has a call to .copy(), which should throw an error for any backed dataset).

Did calling sum work before we had the big scipy update?

flying-sheep commented 4 years ago

See also theislab/scanpy#650

ivirshup commented 4 years ago

This operation is poorly defined for our current backed mode.

What would the expected result be? Does it return a view in backed mode? Do we need to write a new copy of the object somewhere?

flying-sheep commented 4 years ago

For the time being, we should at least raise a better error message than the technical one people get now. Maybe just raise NotImplementedError('filter_genes does not yet(?) work in backed mode')

rbf22 commented 4 years ago

Maybe with your error you can give the suggestion mentioned here: https://github.com/theislab/scanpy/issues/650#issuecomment-499492569

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!