Normalizing subsetted data

Hi @LuckyMD, I've been reprocessing some old data using your single cell tutorial workflow and have a best practices question (I am not sure if this is the correct place for this, or if this should be moved to the new scverse discourse group?). I have an adata object that is scran normalized. I want to take a subset of clusters from the adata object and create a new adata_sub object with its own dimensional reduction to investigate a subpopulation of interest. My understanding is that, had I opted for a basic log normalization, I would not need to re-normalize adata_sub as log normalization is done on a per-cell basis. However, because scran normalization uses a coarse clustering of cells present in the object, I would want to re-normalize adata_sub if adata had been normalized via scran, correct? What would be the best way to subset and re-normalize adata_sub? I am not sure what steps of the original scran normalization process need to be repeated and what steps can be omitted. For instance, I would want to perform a new clustering for the subsetted data for scran normalization and get new size factors, but I wouldn't need to set adata_sub.layers['counts'] = adata_sub.X.copy(), as adata_sub contains a subsetted counts layer from adata (correct?). Would I need to restore adata_sub.raw = adata_sub in this scenario?:

# subset adata to clusters of interest
adata_sub = adata[adata.obs['leiden_r1.0'].isin(['1, '3', '5'])].copy()

# perform clustering for scran normalization
adata_sub_pp = adata_sub.copy()
#sc.pp.normalize_per_cell(adata_sub_pp, counts_per_cell_after = 1e6) - can we omit this since we did it for adata?
#sc.pp.log1p(adata_sub_pp) - can we omit this since we did it for adata?
sc.pp.pca(adata_sub_pp, n_comps = 15)
sc.pp.neighbors(adata_sub_pp)
sc.tl.leiden(adata_sub_pp, key_added = 'groups', resolution = 0.5)

# preprocess variables for scran normalization
input_groups = adata_sub_pp.obs['groups']
data_mat = adata_sub.X.T

%%R -i data_mat -i input_groups -o size_factors

size_factors = sizeFactors(computeSumFactors(SingleCellExperiment(list(counts = data_mat)), 
                                             clusters = input_groups, 
                                             min.mean = 0.1))

del adata_sub_pp

adata_sub.obs['size_factors'] = size_factors # overwrites existing ['size_factors'] from adata

# adata_sub.layers['counts'] = adata_sub.X.copy() - this can be omitted?

# Normalize adata_sub
adata_sub.X /= adata_sub.obs['size_factors'].values[:,None]
sc.pp.log1p(adata_sub) # should this be omitted?
adata_sub.X = sp.sparse.csr_matrix(adata_sub.X)
adata_sub.raw = adata_sub

Thank you for any help and advice!

Hi @oligomyeggo,

I have not previously re-normalized only because I am subsetting the data. I think there's a difference between reproducibility and optimal processing here. In general, normalization aims to remove effects of sequencing depth differences. I would assume this works fairly well in most cases with scran, also across the full dataset. The main question here is whether the assumption that fewer than 50% of genes are differentially expressed (either between coarse cluster centroids or between cells in a coarse cluster) is violated or not. Even if this is violated, if they are equally up- or down-regulated in these comparisons, scran could still be okay. If this assumption is not violated, then I don't think you need to re-normalize. If it is, then maybe normalizing a subset of the data would be useful. In that case you will not be able to compare cells across different subsets though as your normalization baseline is different.

If you really want to re-normalize, you don't need to set the counts again... but everything else I would do. Also note, that scran will genereate its own coarse clustering if you don't do it for the method. I did it at the time as i thought it might be better via louvain and logCPM than the original cosine-normalized implementation.

theislab / single-cell-tutorial

Normalizing subsetted data #90