scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.81k stars 583 forks source link

ValueError: b'Extrapolation not allowed with blending' when using `"sc.pp.highly_variable_genes"` function #2853

Open JuHey opened 4 months ago

JuHey commented 4 months ago

Please make sure these conditions are met

What happened?

During preprocessing of concatenated adata file for scvi-based label transfer, processing fails when applying "sc.pp.highly_variable_genes" function with "ValueError: b'Extrapolation not allowed with blending'"

Minimal code sample

aadata = aadata.concatenate(ref_data_WT)

aadata.X
<15445x13343 sparse matrix of type '<class 'numpy.float64'>'
    with 107849393 stored elements in Compressed Sparse Row format>

# pre-processing:
aadata.layers["counts"] = aadata.X.copy()
sc.pp.normalize_total(aadata, target_sum=1e4)
sc.pp.log1p(aadata)
aadata.raw = aadata

sc.pp.highly_variable_genes(aadata, flavor = 'seurat_v3', n_top_genes=2000,
                            layer = "counts", batch_key="batch", subset = True)#, span =0.5

Error output

ValueError                                Traceback (most recent call last)
Cell In[37], line 7
      4 sc.pp.log1p(aadata)
      5 aadata.raw = aadata
----> 7 sc.pp.highly_variable_genes(aadata, flavor = 'seurat_v3', n_top_genes=2000,
      8                             layer = "counts", batch_key="batch", subset = True)#, span =0.5

File ~/mambaforge/envs/soupxEnv/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:441, in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key, check_values)
    439         sig = signature(_highly_variable_genes_seurat_v3)
    440         n_top_genes = cast(int, sig.parameters["n_top_genes"].default)
--> 441     return _highly_variable_genes_seurat_v3(
    442         adata,
    443         layer=layer,
    444         n_top_genes=n_top_genes,
    445         batch_key=batch_key,
    446         check_values=check_values,
    447         span=span,
    448         subset=subset,
    449         inplace=inplace,
    450     )
    452 if batch_key is None:
    453     df = _highly_variable_genes_single_batch(
    454         adata,
    455         layer=layer,
   (...)
    462         flavor=flavor,
    463     )

File ~/mambaforge/envs/soupxEnv/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:87, in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, check_values, span, subset, inplace)
     85 x = np.log10(mean[not_const])
     86 model = loess(x, y, span=span, degree=2)
---> 87 model.fit()
     88 estimat_var[not_const] = model.outputs.fitted_values
     89 reg_std = np.sqrt(10**estimat_var)

File _loess.pyx:899, in _loess.loess.fit()

ValueError: b'Extrapolation not allowed with blending'

Versions

``` scanpy==1.9.8 anndata==0.10.2 umap==0.5.4 numpy==1.24.4 scipy==1.11.3 pandas==2.1.1 scikit-learn==1.3.1 statsmodels==0.14.0 igraph==0.10.8 pynndescent==0.5.10 ```
eroell commented 4 months ago

Hey - it would be most helpful to post user questions in the scverse forum - there, other users encountering the same question will be able to find a response easier :)

Here, to take care of bugs in scanpy, it is most helpful for us if you are able to share public data/a small part of it/a synthetic data example so that we can check whats going on. Would it possible to you to supply something like that, such that I can reproduce your example myself?

From a first glance, with seurat_v3 requiring count data, it is important that your .X (becoming the layer you refer to as counts) indeed contains counts, otherwise loess quickly runs into stability issues.

ivirshup commented 3 months ago

From a first glance, with seurat_v3 requiring count data, it is important that your .X (becoming the layer you refer to as counts) indeed contains counts, otherwise loess quickly runs into stability issues.

I would expect there would be a warning here if this were the case, since check_values defaults to True.

But at least this person had the same error caused by passing in normalized values:

The author of scikit-misc says:

pass surface="direct"

to the loess solver based only off the error message. So maybe we can enable that.

I don't know enough about loess to be able to say why that would fix this. It would be interesting to see the data that caused this error. I would definitely want to have a reproducible case before attempting a fix.

FlowerPurr commented 3 months ago

My code also has the same bug problem, may I ask, how to solve it?

adata.raw = adata # keep full dimension safe sc.pp.highly_variable_genes( adata, flavor="seurat_v3",# n_top_genes=2000, layer="counts", batch_key="orig.ident", subset=True, span=1 )

Error output

`ValueError Traceback (most recent call last) Cell In[13], line 3 1 #高变基因选取 2 adata.raw = adata # keep full dimension safe ----> 3 sc.pp.highly_variable_genes( 4 adata, 5 flavor="seurat_v3",# 6 n_top_genes=2000, 7 layer="counts", 8 batch_key="orig.ident", 9 subset=True, 10 span=1 11 ) 13 filename = 'melanoma_sw_high_var.h5ad' 14 adata.write(filename)

File ~/miniconda3/envs/scanpy/lib/python3.12/site-packages/scanpy/preprocessing/_highly_variable_genes.py:441, in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key, check_values) 439 sig = signature(_highly_variable_genes_seurat_v3) 440 n_top_genes = cast(int, sig.parameters["n_top_genes"].default) --> 441 return _highly_variable_genes_seurat_v3( 442 adata, 443 layer=layer, 444 n_top_genes=n_top_genes, 445 batch_key=batch_key, 446 check_values=check_values, 447 span=span, 448 subset=subset, 449 inplace=inplace, 450 ) 452 if batch_key is None: 453 df = _highly_variable_genes_single_batch( 454 adata, 455 layer=layer, (...) 462 flavor=flavor, 463 )

File ~/miniconda3/envs/scanpy/lib/python3.12/site-packages/scanpy/preprocessing/_highly_variable_genes.py:87, in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, check_values, span, subset, inplace) 85 x = np.log10(mean[not_const]) 86 model = loess(x, y, span=span, degree=2) ---> 87 model.fit() 88 estimat_var[not_const] = model.outputs.fitted_values 89 reg_std = np.sqrt(10**estimat_var)

File _loess.pyx:927, in _loess.loess.fit()

ValueError: b'Extrapolation not allowed with blending'`