scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.89k stars 595 forks source link

sc.pp.highly_variable_genes produces different hvg when run on same object with same n_genes #2579

Open MishalAshraf opened 1 year ago

MishalAshraf commented 1 year ago

Please make sure these conditions are met

What happened?

I would expect that when you call sc.pp.highly_variable_genes on the same dataset and request the same number of genes, that you would get the same output. The below example suggests that this is not the case.

Minimal code sample

adata_sub = sc.read_h5ad("your_favourite_object.h5ad")
n_genes = 1491
for i in range(10):
    sc.pp.highly_variable_genes(adata_sub, n_top_genes=n_genes)

    unique_genes = list(adata_sub.var['highly_variable'][adata_sub.var['highly_variable'] == True].index)

    if i == 0:
        all_unique = list(set(unique_genes))
        print(f"total {len(all_unique)} unique genes")
    else:
        all_unique = list(set(all_unique+unique_genes))
        print(f"total {len(all_unique)} unique genes")

Error output

total 1491 unique genes
total 1814 unique genes
total 2042 unique genes
total 2163 unique genes
total 2237 unique genes
total 2305 unique genes
total 2356 unique genes
total 2401 unique genes
total 2437 unique genes
total 2453 unique genes

Versions

``` ----- anndata 0.9.1 scanpy 1.9.3 ----- PIL 8.4.0 asciitree NA asttokens NA backcall 0.2.0 beta_ufunc NA binom_ufunc NA bottleneck 1.3.4 cffi 1.15.0 cloudpickle 2.0.0 colorama 0.4.4 cycler 0.10.0 cython_runtime NA cytoolz 0.11.0 dask 2022.02.1 dateutil 2.8.1 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 django 4.1.3 entrypoints 0.4 executing 0.8.3 fasteners 0.18 fsspec 2023.4.0 google NA h5py 3.6.0 igraph 0.10.6 ipykernel 6.9.1 ipython_genutils 0.2.0 ipywidgets 7.6.5 jedi 0.18.1 jinja2 3.1.2 joblib 1.1.0 jupyter_server 1.13.5 kiwisolver 1.2.0 kneed 0.8.3 leidenalg 0.10.1 llvmlite 0.38.0 markupsafe 2.0.1 matplotlib 3.5.1 matplotlib_inline NA mishalpy NA mpl_toolkits NA mpmath 1.2.1 msgpack 1.0.2 natsort 8.3.1 nbinom_ufunc NA nt NA ntsecuritycon NA numba 0.55.1 numcodecs 0.11.0 numexpr 2.8.1 numpy 1.21.6 opt_einsum v3.3.0 packaging 21.3 pandas 1.4.3 parso 0.8.3 pickleshare 0.7.5 pkg_resources NA plotly 5.6.0 prompt_toolkit 3.0.20 psutil 5.8.0 pure_eval 0.2.2 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pyparsing 2.4.7 pythoncom NA pytz 2020.1 pywintypes NA ruamel NA scipy 1.7.3 seaborn 0.11.2 session_info 1.0.0 setuptools 61.2.0 six 1.15.0 sklearn 1.0.2 snappy NA sparse 0.14.0 sphinxcontrib NA stack_data 0.2.0 statsmodels 0.13.2 sympy 1.10.1 tblib 1.7.0 texttable 1.6.7 threadpoolctl 2.2.0 tlz 0.11.0 toolz 0.11.2 torch 2.0.1+cpu tornado 6.1 tqdm 4.64.0 traitlets 5.9.0 typing_extensions NA wcwidth 0.2.5 win32api NA win32com NA win32security NA yaml 6.0 zarr 2.14.2 zipp NA zmq 22.3.0 zope NA ----- IPython 8.2.0 jupyter_client 6.1.12 jupyter_core 4.9.2 jupyterlab 3.3.2 notebook 6.4.8 ----- Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] Windows-10-10.0.22000-SP0 ----- Session information updated at 2023-07-31 10:13 ```
MishalAshraf commented 1 year ago

Interestingly, the same code run with a slight modification produces the intended behaviour. Is this intended? `

adata_sub_master = sc.read_h5ad("your_favourite_object.h5ad")
n_genes = 1491

for i in range(10):

    adata_sub = adata_sub_master.copy()
    sc.pp.highly_variable_genes(adata_sub, n_top_genes=n_genes)

    unique_genes = list(adata_sub.var['highly_variable'][adata_sub.var['highly_variable'] == True].index)

    if i == 0:
        all_unique = list(set(unique_genes))
        print(f"total {len(all_unique)} unique genes")
    else:
        all_unique = list(set(all_unique+unique_genes))
        print(f"total {len(all_unique)} unique genes")`