failure to reproduce significant perturbations in Norman et al.

I was unable to recover any perturbations with significant effects using the e-distance and corresponding test in the Norman et al. dataset following paper methods and suggested baselines. Following from the preprint manuscript:

As a baseline for significant perturbations, as defined by the E-test, we suggest at least 300 cells per perturbation (Fig 5B) and 1000 average UMI counts per cell (Fig 5D) as an experimental guideline for distinguishable perturbations.

And additional details from methods section:

Cells were kept if they had a minimum of 1000 UMI counts, and genes with a minimum of 50 cells. 2000 highly variable genes were selected using scanpy.pp.find_variable_genes with flavor ‘seurat_v3’. We normalized the count matrix using scanpy.pp.normalize_total and log-transformed the data using scanpy.pp.log1p; We did not z-scale the data. Next, we computed PCA based on the highly variable genes. The E-distances were computed in that PCA space using 50 components and Euclidean distance. To avoid problems due to different numbers of cells per perturbation, we subsampled each dataset such that all perturbations had the same number of cells. We removed all perturbations with fewer than 50 cells and then subsampled to the number of cells in the smallest perturbation left after filtering.

With min_cells=300 (cell per pert), subsampling/filtering the perturbation conditions resulted in 153 conditions (including control) with 300 total cells. The control cells condition is also subsampled. Upon usage of the e-distance test, 0 passed adjusted significance. I also tried this with min_cells=50 (cell per pert), with the same result of 0 passing significance. This is in contrast to Supp Table 4.

Can you please clarify this discrepancy? Pointing to the code would help. Thanks.

Here is the code I used -

import scanpy as sc
from scperturb import *

def filter_and_subsample(adata, perturbation_key, min_cells=300):
    # Remove perturbations with fewer than n cells
    perturbation_counts = adata.obs[perturbation_key].value_counts()
    valid_perturbations = perturbation_counts[perturbation_counts >= min_cells].index
    adata = adata[adata.obs[perturbation_key].isin(valid_perturbations)]

    # Subsample each perturbation to the number of cells in the smallest perturbation left
    min_cells_in_perturbation = min(adata.obs[perturbation_key].value_counts())
    subsampled_indices = []

    for pert in valid_perturbations:
        pert_indices = adata.obs[adata.obs[perturbation_key] == pert].sample(min_cells_in_perturbation, random_state=42).index
        subsampled_indices.extend(pert_indices)

    adata_subsampled = adata[subsampled_indices]

    return adata_subsampled

adata = sc.read_h5ad("data/NormanWeissman2019_filtered.h5ad")

adata = adata[adata.X.sum(axis=1) >= 1000]

sc.pp.filter_genes(adata, min_cells=50)

# or use min_cells=50
adata = filter_and_subsample(adata, 'perturbation', min_cells=300)

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3')
adata = adata[:, adata.var['highly_variable']]

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.tl.pca(adata, n_comps=50)

estats = edist_to_control(adata, obs_key='perturbation', control='control', 
                          obsm_key='X_pca', dist='sqeuclidean', )
df = etest(adata, obs_key='perturbation', obsm_key='X_pca', dist='sqeuclidean', control='control', alpha=0.05, runs=1000, n_jobs=-1)
df['significant_adj'].sum()

sanderlab / scPerturb

failure to reproduce significant perturbations in Norman et al. #25