scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

ValueError: b'There are other near singularities as well. 0.090619\n' #1504

Open cartal opened 3 years ago

cartal commented 3 years ago

Hi,

Trying to run scVI to analyse my data using the latest scanpy+scvi-tools workflow, as described here.

However, I'm running into a weird issue with the new seurat_v3 flavour to call HVGs. When I run this:

sc.pp.highly_variable_genes(
    adata,
    flavor = "seurat_v3",
    n_top_genes = 7000,
    layer = "counts",
    batch_key = "combined",
    subset = True
)

I get the following error:

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-3748de5bacdc> in <module>
      5     layer = "counts",
      6     batch_key = "combined",
----> 7     subset = True
      8 )
      9 adata

/opt/conda/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key)
    420             span=span,
    421             subset=subset,
--> 422             inplace=inplace,
    423         )
    424 

/opt/conda/lib/python3.7/site-packages/scanpy/preprocessing/_highly_variable_genes.py in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, span, subset, inplace)
     82         x = np.log10(mean[not_const])
     83         model = loess(x, y, span=span, degree=2)
---> 84         model.fit()
     85         estimat_var[not_const] = model.outputs.fitted_values
     86         reg_std = np.sqrt(10 ** estimat_var)

_loess.pyx in _loess.loess.fit()

ValueError: b'There are other near singularities as well. 0.090619\n'

While looking for a solution, I came across this issue that reports a similar problem.

Any ideas of what this may be?

Thanks

Versions

[WARNING: If you miss a compact list, please try `print_header`! ----- anndata 0.7.4 scanpy 1.6.0 sinfo 0.3.1 ----- PIL 8.0.1 SCCAF NA anndata 0.7.4 backcall 0.1.0 cffi 1.14.3 colorama 0.4.4 cycler 0.10.0 cython_runtime NA dateutil 2.8.1 decorator 4.4.1 get_version 2.1 google NA h5py 2.10.0 igraph 0.8.0 importlib_metadata 1.7.0 ipykernel 5.1.4 ipython_genutils 0.2.0 ipywidgets 7.5.1 jedi 0.16.0 joblib 0.17.0 kiwisolver 1.1.0 legacy_api_wrap 1.2 leidenalg 0.8.2 llvmlite 0.31.0 louvain 0.6.1 matplotlib 3.1.3 mpl_toolkits NA natsort 7.0.1 numba 0.48.0 numexpr 2.7.1 numpy 1.19.4 packaging 20.4 pandas 1.1.4 parso 0.6.1 patsy 0.5.1 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.3 psutil 5.7.2 ptyprocess 0.6.0 pycparser 2.20 pygments 2.7.2 pyparsing 2.4.7 pytz 2020.1 rich NA ruamel NA scanpy 1.6.0 scipy 1.5.4 scvi 0.7.1 seaborn 0.10.0 setuptools_scm NA sinfo 0.3.1 six 1.15.0 sklearn 0.23.2 statsmodels 0.11.1 storemagic NA tables 3.6.1 texttable 1.6.2 threadpoolctl 2.1.0 torch 1.6.0 tornado 6.0.3 tqdm 4.32.2 traitlets 4.3.3 typing_extensions NA umap 0.3.10 wcwidth NA yaml 5.3.1 zipp NA zmq 19.0.0 ----- IPython 7.12.0 jupyter_client 6.0.0 jupyter_core 4.6.3 jupyterlab 1.2.5 notebook 6.0.3 ----- Python 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] Linux-4.4.0-189-generic-x86_64-with-debian-buster-sid 24 logical CPU cores, x86_64 ----- Session information updated at 2020-11-23 09:17
ivirshup commented 3 years ago

Seems like this is a known issue in the source library: https://github.com/has2k1/scikit-misc/issues/9.

I don't think I could comment much more off the top of my head. @adamgayoso, do you have any suggestions here? Is trying to increase the span fine?

adamgayoso commented 3 years ago

I think the solution is to remove some of the most lowly expressed genes, though increasing span may also work.

ivirshup commented 3 years ago

Could you suggest some error handling behavior here? I think there could definitely be a more helpful error message.

gokceneraslan commented 3 years ago

I also experienced this a few times, and took me some time to understand what is going on. I fully agree with @ivirshup, we should improve the error message.

cartal commented 3 years ago

In the end, increasing the span to 1 fixed it for me. However, I'm still not sure why it happened.

adamgayoso commented 3 years ago

I wish I understood why this was happening too. I believe it's the same underlying C code as the R implementation, and I don't think Seurat's code does anything special to prevent this.

Increasing the span could really affect which genes are selected as HVG I believe, whereas removing some outliers by low expression might not?

gokceneraslan commented 3 years ago

In my experience, this happens if batch key is not None and one or more batches have low number of cells. Does it make sense to catch this error and simply skip the problematic batch or inform the user that batch doesn't have enough cells?

gokceneraslan commented 3 years ago

@cartal Do you mind sharing the output of the following:

adata.obs.combined.value_counts()

xiachenrui commented 2 years ago

I think the problem happen when some category in 'batch_key' only have one sample

jamesnemesh commented 1 year ago

I ran into this recently - the problem can occur when batch key has many cells in each batch (see plot). Increasing the span from the default of 0.3 to 0.5 seems to have "fixed" the error. Increasing the filtering stringency for lowly expressed genes (to min_gene=500, min_cells=10) also gets rid of the error.

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.highly_variable_genes(
            adata,
            layer="counts",
            flavor="seurat_v3",
            n_top_genes=num_hvgs,
            batch_key='sex_cell_subtype',
            span=0.5
        )
image