rank_genes_groups fails to account for multiple comparsions when multiple groups are provided. #3221

Open a93sokol opened 4 weeks ago

a93sokol commented 4 weeks ago

Please make sure these conditions are met

What happened?

When running, one can provide multiple groups for comparison. However, correction for multiple comparisons are done only within a single pair of group and reference. In other words,

for group2 in perts:
                                    groupby = "gene", 
                                    groups = [group2], 
                                    reference = group1, 
                                    method = 'wilcoxon')

Gives exactly the same p-values as the following:, 
                                    groupby = "gene", 
                                    groups = perts, 
                                    reference = group1, 
                                    method = 'wilcoxon')

To give a bit more context, I am processing Perturb-Seq data from Replogle Cell 2022 paper. I am new to bioinformatics, hence might be missing something. Please let me know whether it is an actual bug or me using the package wrong.

Minimal code sample

import numpy as np
import pandas as pd
import scanpy as sc
from anndata import AnnData

p_value_threshold = 0.05
# Create a minimalistic AnnData object
data = np.random.rand(1000, 5)  # 1000 cells, 5 genes
obs = pd.DataFrame(index=[f'cell{i}' for i in range(1000)])
var = pd.DataFrame(index=[f'gene{i}' for i in range(5)])
adata = AnnData(X=data, obs=obs, var=var)

# Add a 'gene' column to obs to use as groupby
adata.obs['gene'] = np.random.choice(['sample0', 'sample1', 'sample2', 'sample3', 'sample4'], size=1000)

# Define groups
group1 = 'sample0'
perts = ['sample1', 'sample2', 'sample3', 'sample4']

# Run the loop to get p-values
for group2 in perts:,
    result = adata.uns["rank_genes_groups"]
    #mask = result['pvals_adj'][group2] < p_value_threshold
    filtered_genes = result['names'][group2]#[mask]
    filtered_pvals = result['pvals_adj'][group2]#[mask]
    filtered_scores = result['scores'][group2]#[mask]

# Run all at once,

result = adata.uns["rank_genes_groups"]
for group2 in perts:
#mask = result['pvals_adj'][group2] < p_value_threshold
    filtered_genes = result['names'][group2]#[mask]
    filtered_pvals = result['pvals_adj'][group2]#[mask]
    filtered_scores = result['scores'][group2]#[mask]

Error output

I would expect to see different adjusted p-values for the first and the second case. When looping (first case) the method does not see other comparisons coming from the loop, while in the second case the method does see them but still does not correct for them.


