scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 600 forks source link

highly variable genes + batch_key --> reciprocal condition number error #2669

Open iamsalil opened 1 year ago

iamsalil commented 1 year ago

Please make sure these conditions are met

What happened?

Note: This bug seems to have been mentioned on 2/13/2022 in this discourse.scverse.org thread: https://discourse.scverse.org/t/error-in-highly-variable-gene-selection/276. However, I can't seem to access scverse.org so I can't see what the cause/resolution was.

Issue: I am running highly_variable_genes with flavor='seurat_v3'. When I do not include a batch_key, the function runs fine. When I add a batch_key, I get a numerical error.

Minimal code sample

sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=1000, batch_key="Covariate")

Error output

----> 1 sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=1000, batch_key="Covariate")

File ~/miniconda3/envs/singlecell/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py:428, in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key, check_values)
    422     raise ValueError(
    423         '`pp.highly_variable_genes` expects an `AnnData` argument, '
    424         'pass `inplace=False` if you want to return a `pd.DataFrame`.'
    425     )
    427 if flavor == 'seurat_v3':
--> 428     return _highly_variable_genes_seurat_v3(
    429         adata,
    430         layer=layer,
    431         n_top_genes=n_top_genes,
    432         batch_key=batch_key,
    433         check_values=check_values,
    434         span=span,
    435         subset=subset,
    436         inplace=inplace,
    437     )
    439 if batch_key is None:
    440     df = _highly_variable_genes_single_batch(
    441         adata,
    442         layer=layer,
   (...)
    449         flavor=flavor,
    450     )

File ~/miniconda3/envs/singlecell/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py:85, in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, check_values, span, subset, inplace)
     83 x = np.log10(mean[not_const])
     84 model = loess(x, y, span=span, degree=2)
---> 85 model.fit()
     86 estimat_var[not_const] = model.outputs.fitted_values
     87 reg_std = np.sqrt(10**estimat_var)

File _loess.pyx:899, in _loess.loess.fit()

ValueError: b'reciprocal condition number  6.4708e-16'

Versions

``` ----- anndata 0.9.1 scanpy 1.9.3 ----- PIL 9.5.0 asttokens NA backcall 0.2.0 cffi 1.15.1 comm 0.1.2 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 executing 0.8.3 h5py 3.8.0 igraph 0.10.4 importlib_resources NA ipykernel 6.19.2 ipython_genutils 0.2.0 jedi 0.18.1 joblib 1.2.0 jupyter_server 1.23.4 kiwisolver 1.4.4 leidenalg 0.9.1 llvmlite 0.40.0 matplotlib 3.7.1 mpl_toolkits NA natsort 8.3.1 numba 0.57.0 numpy 1.24.3 packaging 23.0 pandas 2.0.1 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 2.5.2 prompt_toolkit 3.0.36 psutil 5.9.0 ptyprocess 0.7.0 pure_eval 0.2.2 pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pyparsing 3.0.9 pytz 2022.7 scipy 1.10.1 session_info 1.0.0 setuptools 66.0.0 six 1.16.0 sklearn 1.2.2 stack_data 0.2.0 texttable 1.6.7 threadpoolctl 3.1.0 tornado 6.2 traitlets 5.7.1 typing_extensions NA wcwidth 0.2.5 yaml 6.0.1 zipp NA zmq 25.0.2 ----- IPython 8.12.0 jupyter_client 8.1.0 jupyter_core 5.3.0 jupyterlab 3.5.3 notebook 6.5.4 ----- Python 3.8.16 (default, Mar 2 2023, 03:21:46) [GCC 11.2.0] Linux-4.15.0-212-generic-x86_64-with-glibc2.17 ----- Session information updated at 2023-09-19 02:18 ```
DawnChou commented 1 year ago

I am encountering the same error. Have you fixed it?

Aeget1000 commented 1 year ago

I encountered the same problem

VladimirShitov commented 1 year ago

Hi, everyone, @DawnChou , @Aeget1000 , @iamsalil . I faced the same problem as well. TLDR: choose a higher span value in sc.pp.highly_variable_genes. The default is 0.3, which caused an error for me as well. 0.5 worked fine in my case. The information below might be interesting for developers or anyone who wants to understand this error more deeply.

I got the error when using HLCA data. If scanpy developers are interested, I can point to the dataset to reproduce this problem. It is quite big, but I don't know any other example yet.

The error is caused by this line. I got it when selecting HVGs by "dataset" batch key in HLCA. Batch "Sims_2019" caused the problem. Surprisingly, relationships between mean and var as well as between x and y seemed ok:

mean_var_relationship

x_y_relationship

However, something was still causing the problem. I tried to locate the error in the loess calucation in the original package but did not succeed. Anyway, this is a bit out of the scope of scanpy. Setting span to a higher value (0.5) solved the problem for me. If there is no strong argument against it, I suggest changing the default value from 0.3 to 0.5.

By the way, there is another potential bug in this function. If all the values are constant and not_const only consists of False, kernel dies when trying to run model.fit(). Maybe it is prevented previously, but in case it isn't, you might want to check that.