scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 603 forks source link

Rank genes groups wilcoxon test fails for more than 10 million cells #3377

Open abs51295 opened 3 days ago

abs51295 commented 3 days ago

Please make sure these conditions are met

What happened?

Running rank_genes_groups function with a dataset with more than 10 million cells would fail since MAX_SIZE is hardcoded here: https://github.com/scverse/scanpy/blob/751eafac9259edfacf083b0ffff268ca93182cd9/src/scanpy/tools/_rank_genes_groups.py#L51

I tried doing it and here's the error:

     72 # Calculate chunk frames
     73 max_chunk = floor(CONST_MAX_SIZE / n_cells)
---> 75 for left in range(0, n_genes, max_chunk):
     76     right = min(left + max_chunk, n_genes)
     78     df = pd.DataFrame(data=get_chunk(X, left, right))

ValueError: range() arg 3 must not be zero

Minimal code sample

sc.tl.rank_genes_groups(adata=adata, groupby='louvain', method='wilcoxon', pts=True, use_raw=False)

Error output

Versions

``` ----- anndata 0.11.1 scanpy 1.10.4 ----- PIL 11.0.0 anyio NA arrow 1.3.0 asttokens NA attr 24.2.0 attrs 24.2.0 babel 2.16.0 backports NA brotli 1.1.0 cachetools 5.5.0 certifi 2024.08.30 cffi 1.17.1 charset_normalizer 3.4.0 click 8.1.7 cloudpickle 3.1.0 colorama 0.4.6 comm 0.2.2 cuda 0+untagged.302.g4a12ae2.dirty cudf 24.10.01 cugraph 24.10.00 cuml 24.10.00 cupy 13.3.0 cupy_backends NA cupyx NA cycler 0.12.1 cython_runtime NA cytoolz 1.0.0 dask 2024.9.0 dask_cuda 24.10.00 dask_cudf 24.10.01 dask_expr 1.1.14 dateutil 2.9.0.post0 debugpy 1.8.8 decorator 5.1.1 defusedxml 0.7.1 distributed 2024.9.0 executing 2.1.0 fastjsonschema NA fastrlock 0.8.2 fqdn NA fsspec 2024.10.0 google NA h5py 3.12.1 idna 3.10 igraph 0.11.6 ipykernel 6.29.5 isoduration NA jaraco NA jedi 0.19.2 jinja2 3.1.4 joblib 1.4.2 json5 0.9.28 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema_specifications NA jupyter_events 0.10.0 jupyter_server 2.14.2 jupyterlab_server 2.27.3 kiwisolver 1.4.7 legacy_api_wrap NA leidenalg 0.10.2 llvmlite 0.43.0 locket NA louvain 0.8.2 lz4 4.3.3 markupsafe 3.0.2 matplotlib 3.9.2 matplotlib_inline 0.1.7 more_itertools 10.5.0 mpl_toolkits NA msgpack 1.1.0 natsort 8.4.0 nbformat 5.10.4 networkx 3.4.2 numba 0.60.0 numpy 1.26.4 nvtx NA overrides NA packaging 24.2 pandas 2.2.2 parso 0.8.4 patsy 1.0.1 pickleshare 0.7.5 pkg_resources NA platformdirs 4.3.6 prometheus_client NA prompt_toolkit 3.0.48 psutil 6.1.0 pure_eval 0.2.3 pyarrow 17.0.0 pycparser 2.22 pydev_ipython NA pydevconsole NA pydevd 3.2.2 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.18.0 pylibcudf NA pylibcugraph 24.10.00 pylibraft 24.10.00 pynndescent 0.5.13 pynvjitlink 0.4.0 pynvml 11.4.1 pyparsing 3.2.0 pythonjsonlogger NA pytz 2024.2 raft_dask 24.10.00 rapids_dask_dependency NA rapids_singlecell 0.10.11 referencing NA requests 2.32.3 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rmm 24.10.00 rpds NA scipy 1.14.1 send2trash NA session_info 1.0.0 six 1.16.0 sklearn 1.5.2 sniffio 1.3.1 socks 1.7.1 sortedcontainers 2.4.0 sparse 0.15.4 stack_data 0.6.2 statsmodels 0.14.4 tblib 3.0.0 texttable 1.7.0 threadpoolctl 3.5.0 tlz 1.0.0 toolz 1.0.0 torch 2.4.1.post300 torchgen NA tornado 6.4.1 tqdm 4.67.0 traitlets 5.14.3 treelite 4.3.0 typing_extensions NA umap 0.5.7 uri_template NA urllib3 2.2.3 wcwidth 0.2.13 webcolors 24.8.0 websocket 1.8.0 yaml 6.0.2 zict 3.0.0 zipp NA zmq 26.2.0 zoneinfo NA zstandard 0.23.0 ----- IPython 8.29.0 jupyter_client 8.6.3 jupyter_core 5.7.2 jupyterlab 4.2.6 notebook 7.2.2 ----- Python 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0] Linux-4.18.0-348.el8.x86_64-x86_64-with-glibc2.28 ----- Session information updated at 2024-11-19 14:22 ```