[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of scanpy.
[x] (optional) I have confirmed this bug exists on the main branch of scanpy.
What happened?
When running sc.tl.rank_genes_groups, one can provide multiple groups for comparison. However, correction for multiple comparisons are done only within a single pair of group and reference. In other words,
for group2 in perts:
sc.tl.rank_genes_groups(adata,
groupby = "gene",
groups = [group2],
reference = group1,
method = 'wilcoxon')
To give a bit more context, I am processing Perturb-Seq data from Replogle Cell 2022 paper. I am new to bioinformatics, hence might be missing something. Please let me know whether it is an actual bug or me using the package wrong.
Minimal code sample
import numpy as np
import pandas as pd
import scanpy as sc
from anndata import AnnData
p_value_threshold = 0.05
# Create a minimalistic AnnData object
data = np.random.rand(1000, 5) # 1000 cells, 5 genes
obs = pd.DataFrame(index=[f'cell{i}' for i in range(1000)])
var = pd.DataFrame(index=[f'gene{i}' for i in range(5)])
adata = AnnData(X=data, obs=obs, var=var)
# Add a 'gene' column to obs to use as groupby
adata.obs['gene'] = np.random.choice(['sample0', 'sample1', 'sample2', 'sample3', 'sample4'], size=1000)
# Define groups
group1 = 'sample0'
perts = ['sample1', 'sample2', 'sample3', 'sample4']
# Run the loop to get p-values
for group2 in perts:
sc.tl.rank_genes_groups(adata,
groupby='gene',
groups=[group2],
reference=group1,
method='wilcoxon')
result = adata.uns["rank_genes_groups"]
#mask = result['pvals_adj'][group2] < p_value_threshold
filtered_genes = result['names'][group2]#[mask]
filtered_pvals = result['pvals_adj'][group2]#[mask]
filtered_scores = result['scores'][group2]#[mask]
print(filtered_pvals)
print('______________________________________________')
# Run all at once
sc.tl.rank_genes_groups(adata,
groupby='gene',
groups=perts,
reference=group1,
method='wilcoxon')
result = adata.uns["rank_genes_groups"]
for group2 in perts:
#mask = result['pvals_adj'][group2] < p_value_threshold
filtered_genes = result['names'][group2]#[mask]
filtered_pvals = result['pvals_adj'][group2]#[mask]
filtered_scores = result['scores'][group2]#[mask]
print(filtered_pvals)
Error output
I would expect to see different adjusted p-values for the first and the second case. When looping (first case) the method does not see other comparisons coming from the loop, while in the second case the method does see them but still does not correct for them.
Versions
```
-----
anndata 0.10.7
scanpy 1.10.2
-----
PIL 10.4.0
anyio NA
apport_python_hook NA
arrow 1.3.0
asttokens NA
attr 23.2.0
attrs 23.2.0
babel 2.14.0
cairo 1.20.1
certifi 2024.07.04
cffi 1.16.0
chardet 4.0.0
charset_normalizer 3.3.2
cloudpickle 3.0.0
colorama 0.4.4
comm 0.2.2
cycler 0.12.1
cython_runtime NA
cytoolz 0.12.3
dask 2024.8.0
dateutil 2.9.0.post0
debugpy 1.8.1
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.8
exceptiongroup 1.2.1
executing 2.0.1
fastjsonschema NA
fqdn NA
gi 3.42.1
gio NA
glib NA
gobject NA
gtk NA
h5py 3.11.0
idna 3.3
igraph 0.11.5
ipykernel 6.29.4
isoduration NA
jedi 0.19.1
jinja2 3.1.3
joblib 1.4.2
json5 0.9.25
jsonpointer 2.4
jsonschema 4.22.0
jsonschema_specifications NA
jupyter_events 0.10.0
jupyter_server 2.14.0
jupyterlab_server 2.27.1
kiwisolver 1.4.5
legacy_api_wrap NA
leidenalg 0.10.2
llvmlite 0.42.0
louvain 0.8.2
lz4 4.3.3
markupsafe 2.1.5
matplotlib 3.8.4
more_itertools 8.10.0
mpl_toolkits NA
mpmath 1.3.0
natsort 8.4.0
nbformat 5.10.4
netifaces 0.11.0
numba 0.59.1
numexpr 2.10.1
numpy 1.26.4
overrides NA
packaging 24.0
pandas 2.2.2
parso 0.8.4
pkg_resources NA
platformdirs 4.2.1
plotly 5.22.0
prometheus_client NA
prompt_toolkit 3.0.43
psutil 5.9.8
pure_eval 0.2.2
pyarrow 16.0.0
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pydot 2.0.0
pygments 2.17.2
pyparsing 3.1.2
pythonjsonlogger NA
pytz 2022.1
referencing NA
requests 2.31.0
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rpds NA
scipy 1.13.0
send2trash NA
session_info 1.0.0
sitecustomize NA
six 1.16.0
sklearn 1.4.2
sniffio 1.3.1
socks 1.7.1
stack_data 0.6.3
sympy 1.12
tblib 3.0.0
texttable 1.7.0
threadpoolctl 3.5.0
tlz 0.12.3
toolz 0.12.1
torch 2.1.2+cu121
torchgen NA
tornado 6.4
tqdm 4.66.4
traitlets 5.14.3
typing_extensions NA
tzdata 2024.1
uri_template NA
urllib3 2.2.1
wcwidth 0.2.13
webcolors 1.13
websocket 1.8.0
xxhash NA
yaml 5.4.1
zipp NA
zmq 26.0.3
zoneinfo NA
-----
IPython 8.24.0
jupyter_client 8.6.1
jupyter_core 5.7.2
jupyterlab 4.2.3
notebook 7.2.0
-----
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
Linux-6.5.0-1027-oem-x86_64-with-glibc2.35
-----
Session information updated at 2024-09-02 14:19
```
Please make sure these conditions are met
What happened?
When running sc.tl.rank_genes_groups, one can provide multiple groups for comparison. However, correction for multiple comparisons are done only within a single pair of group and reference. In other words,
Gives exactly the same p-values as the following:
To give a bit more context, I am processing Perturb-Seq data from Replogle Cell 2022 paper. I am new to bioinformatics, hence might be missing something. Please let me know whether it is an actual bug or me using the package wrong.
Minimal code sample
Error output
Versions