Incorrect number setting ofr highly_variable_genes

scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.

https://scanpy.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.82k stars 588 forks source link

Incorrect number setting ofr highly_variable_genes #2819

Closed HelloWorldLTY closed 6 months ago

HelloWorldLTY commented 6 months ago

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of scanpy.
[X] (optional) I have confirmed this bug exists on the master branch of scanpy.

What happened?

Hi, I intend to use n_top_gene to determine the number of hvgs I intend to have, but I met such errors.

I used to meet this error previously, but no effective solutions.

Could you please double check it? Thanks.

Minimal code sample

Error output

No response

Versions

``` ```

anndata 0.9.2 scanpy 1.9.6

flying-sheep commented 6 months ago

Hi, I don’t see any errors in your example. Can you please specify what you expected to happen and how what happened differs from that?

Also please provide code as text (in a Markdown code block), not as images.

HelloWorldLTY commented 6 months ago

Hi, I intend to have 13634 hvgs, but it gave me 13652 genes. I am curious about the difference. Is it a bug?

Code:

adata_atac.X = (adata_atac.X > 0)*1
sc.pp.highly_variable_genes(adata_atac, n_top_genes=13634)
adata_atac = adata_atac[:,adata_atac.var['highly_variable']]

flying-sheep commented 6 months ago

Could have all kinds of reasons. Genes sharing the same score, duplicate gene names.

We won’t be able to tell without a reproducible example.

HelloWorldLTY commented 6 months ago

If it is caused by the same score, how to ensure the uniquenss? Thanks. I did not receive the var.unique.name error thus I think it was not caused by duplicate gene names

It is neurips 2021 single cell competition dataset.

flying-sheep commented 6 months ago

I’ll take a look if you update your issue with a code block that I can copy, that will download the dataset and then reproduce the error without me having to go to any website and download anything manually.

HelloWorldLTY commented 6 months ago

Thanks. The dataset is here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122

Sicne it is from GSE, I think it is easy to download and it is formed into h5ad format (the multiome part). GSE194122_openproblems_neurips2021_multiome_BMMC_processed.h5ad.gz

Code:

adata_atac.X = (adata_atac.X > 0)*1
sc.pp.highly_variable_genes(adata_atac, n_top_genes=13634)
adata_atac = adata_atac[:,adata_atac.var['highly_variable']]

flying-sheep commented 6 months ago

So this would be a reproducible example:

import gzip
import shutil
from urllib.request import urlopen
from pathlib import Path

from tqdm.notebook import tqdm
import scanpy as sc

url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE194122&format=file&file=GSE194122%5Fopenproblems%5Fneurips2021%5Fmultiome%5FBMMC%5Fprocessed%2Eh5ad%2Egz"

path = Path("data/GSE194122_openproblems_neurips2021_multiome_BMMC_processed.h5ad")
if not path.is_file():
    with (
        urlopen(url) as raw,
        tqdm.wrapattr(raw, "read", total=int(raw.headers["Content-Length"])) as wrapped,
        gzip.open(wrapped, 'rb') as f_in,
        path.open('wb') as f_out,
    ):
        shutil.copyfileobj(f_in, f_out)

adata_atac = sc.read(path)
adata_atac.X = (adata_atac.X > 0)*1
sc.pp.highly_variable_genes(adata_atac, n_top_genes=13634)
adata_atac = adata_atac[:,adata_atac.var['highly_variable']]

flying-sheep commented 6 months ago

Problem seems to be fixed in the newest scanpy:

grafik