scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.89k stars 594 forks source link

`score_genes` does not work on a dataset loaded with `backed="r+"` #883

Open mfiers opened 4 years ago

mfiers commented 4 years ago

Hi - excellent software, thanks! -

but I do have a problem. If i load a disk backed dataset, I cannot run sc.tl.score_genes.

Given these two sets:

ad = sc.read_h5ad('scdataset.h5ad', backed='r+')
ad2 = sc.read_h5ad('scdataset.h5ad')

and

random_genes = list(ad.var_names.to_series().sample(100))

this works perfectly:

sc.tl.score_genes(ad2, random, score_name="random100", random_state=42)

but, this:

sc.tl.score_genes(ad, random, score_name="random100", random_state=42)

yields the following error:

-----------------------------------------------------------------
ValueError                      Traceback (most recent call last)
<ipython-input-113-9cb28e089b25> in <module>
----> 1 sc.tl.score_genes(ad, random, score_name="random100", random_state=42)

~/.pyenv/versions/mfpy372/lib/python3.7/site-packages/scanpy/tools/_score_genes.py in score_genes(adata, gene_list, ctrl_size, gene_pool, n_bins, score_name, random_state, copy, use_raw)
     90     else:
     91         obs_avg = pd.Series(
---> 92             np.nanmean(_adata[:, gene_pool].X, axis=0), index=gene_pool)  # average expression of genes
     93 
     94     obs_avg = obs_avg[np.isfinite(obs_avg)] # Sometimes (and I don't know how) missing data may be there, with nansfor

<__array_function__ internals> in nanmean(*args, **kwargs)

~/.pyenv/versions/mfpy372/lib/python3.7/site-packages/numpy/lib/nanfunctions.py in nanmean(a, axis, dtype, out, keepdims)
    949     cnt = np.sum(~mask, axis=axis, dtype=np.intp, keepdims=keepdims)
    950     tot = np.sum(arr, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
--> 951     avg = _divide_by_count(tot, cnt, out=out)
    952 
    953     isbad = (cnt == 0)

~/.pyenv/versions/mfpy372/lib/python3.7/site-packages/numpy/lib/nanfunctions.py in _divide_by_count(a, b, out)
    216         else:
    217             if out is None:
--> 218                 return a.dtype.type(a / b)
    219             else:
    220                 # This is questionable, but currently a numpy scalar can

ValueError: setting an array element with a sequence.

thanks Mark

flying-sheep commented 4 years ago

Weird, this sure looks like a bug! Can you give use code to reproduce it with the builtin datasets?

bkmartinjr commented 4 years ago

previously filed on anndata: theislab/anndata#234

mfiers commented 4 years ago

Not sure if still required (the code in the anndata repo is close to this), but here is a minimal example:

import scanpy as sc
adata =  sc.datasets.pbmc3k()
h5adfile = 'pbmc3k.h5ad'
adata.write(h5adfile)
a1 = sc.read_h5ad(h5adfile)
a2 = sc.read_h5ad(h5adfile, backed='r+')
sc.tl.score_genes(a1, ['KIR3DL2-1', 'AL590523.1', 'CT476828.1'])
# OK
sc.tl.score_genes(a2, ['KIR3DL2-1', 'AL590523.1', 'CT476828.1'])
# ERROR 

thanks..

Liripo commented 1 year ago

Has anyone solved it?