`hvg` selects more genes than asked for

ilan-gold commented 1 month ago

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of scanpy.
[X] (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

HVG can produce more than the number of genes asked for as highly variable.

This occurs on these two datasets:

wget https://datasets.cellxgene.cziscience.com/e00ab1f4-28cd-497d-b889-94d45840f423.h5ad

Minimal code sample

import scanpy as sc

adata1 = sc.read('e00ab1f4-28cd-497d-b889-94d45840f423.h5ad')

sc.pp.normalize_total(adata1, target_sum=1e4)

sc.pp.log1p(adata1)

n_top_gene = 10000
sc.pp.highly_variable_genes(adata1, n_top_genes = n_top_gene)

hvg_system1 = set(adata1.var[adata1.var['highly_variable']].index)
assert len(hvg_system1) == n_top_gene, f"found {len(hvg_system1)} instead of {n_top_gene}"

Error output

AssertionError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 assert len(hvg_system1) == n_top_gene, f"found {len(hvg_system1)} instead of {n_top_gene}"

AssertionError: found 13355 instead of 10000

Versions

``` import scanpy; scanpy.logging.print_versions() ----- anndata 0.10.8 scanpy 1.10.0rc2.dev85+gb918a23e ----- ```

ilan-gold commented 1 month ago

@eroell Can you give some context here? The issue is the following lines:

https://github.com/scverse/scanpy/blob/b918a23eb77462837df90d7b3a30a573989d4d48/src/scanpy/preprocessing/_highly_variable_genes.py#L383-L418

Basically, disp_cut_off = _nth_highest(dispersion_norm, n_top_genes) can produce a negative value (it does not consider nans at all since it first calls x = x[~np.isnan(x)]). So the last line np.nan_to_num(dispersion_norm) >= disp_cut_off produces more genes than you'd expect because nans become 0, but you're comparing to a negative value that was calculated without considering said nan-to-0's.

So what is the fix here? Converting nans to minus infinity? Converting nans to 0 before getting the nth highest?

flying-sheep commented 1 month ago

yeah, a fix could simply be

-return np.nan_to_num(dispersion_norm) >= disp_cut_off
+return np.nan_to_num(dispersion_norm, nan=-np.inf) >= disp_cut_off

but I have a hard time coming up with a test. Doing something like this doesn’t work, as it crashes earlier, with something like “ValueError: cannot specify integer bins when input data contains infinity”

@pytest.mark.parametrize("flavor", ["seurat", "cell_ranger"])
def test_no_filter_genes(flavor):
    """Test that even with 0 columns in the data, n_top_genes is respected."""
    adata = pbmc3k()
    means, _ = _get_mean_var(adata.X)
    assert (means == 0).any()
    sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
    assert adata.var["highly_variable"].sum() == 10000

eroell commented 1 month ago

Sorry for being slow here - not sure how I got the mention as this part of hvg I have not contributed to in the past but happy to give my 5 cents to this.

This occurs on these two datasets:

I think you only linked one? Which is enough as I assume it is only the constant-gene case which is causing this issue, which is indeed present just as you showed in this dataset.

Making a test

ValueError: cannot specify integer bins when input data contains infinity”

I think this test is missing sc.pp.normalize_total and sc.pp.log1p for flavor="seurat”. The following test passes with the fix, and fails with the unfixed prior version.

def test_no_filter_genes(flavor):
    """Test that even with 0 columns in the data, n_top_genes is respected."""
    adata = sc.datasets.pbmc3k()
    means, _ = _get_mean_var(adata.X)
    assert (means == 0).any()
    sc.pp.normalize_total(adata, target_sum=10000)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
    assert adata.var["highly_variable"].sum() == 10000
test_no_filter_genes("seurat")

Happy to make this PR

Below second comment on follow up issue...

eroell commented 1 month ago

The data also allows to detect a next issue: When multiple genes have the same value of disp_cut_off.

Can be found if here e.g. dont do sc.pp.normalize_total:

import scanpy as as
adata = sc.datasets.pbmc3k()
# sc.pp.normalize_total(adata, target_sum=10000)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
adata.var["highly_variable"].sum()

Which is due to many genes having the value selected for the disp_cut_off here, having

...x[n-2] = x[n-1 ] = x[n] = x[n+1]= x[n+2]...

https://github.com/scverse/scanpy/blob/b918a23eb77462837df90d7b3a30a573989d4d48/src/scanpy/preprocessing/_highly_variable_genes.py#L408-L418

I tried to check how Seurat is proceeding in such a case, expecting to see how it breaks the ties. (data downloaded from here) Here I'm actually not sure how to turn off the scale.factor argument? Its set to 10'000 by default.

library(dplyr)
library(Seurat)
library(patchwork)

pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k")

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor=10000)

pbmc <- FindVariableFeatures(pbmc, selection.method = "mean.var.plot", nfeatures = 10000)

length(VariableFeatures(pbmc))

However, it turns out Seurat seems to restrict to the genes which are variable in the sense of passing the set mean threshold and normalized dispersion thresholds. These thresholds are ignored in scanpy if the number of genes is given.

So not really an insight of how to break ties in this case. Would suggest to make a new issue, which the potential project on comparing the frameworks could address.

flying-sheep commented 1 month ago

The following test passes with the fix, and fails with the unfixed prior version.

Ah, with normalize_total, it works with seurat, thank you!

When multiple genes have the same value of disp_cut_off.

I think when the n-th highest value is tied with others, returning more than the specified number makes total sense.

As done by scipy.stats.rankdata, the rank of identical data points is identical. So the “values up to the 4th-highest” of the array [6,5,3,3,3,1,0] are [6,5,3,3,3], not [6,5,3,3]. It makes no sense to include fewer than all of the 3s here.

scverse / scanpy