igrabski / sc-SHC

Significance analysis for clustering single-cell RNA-sequencing data
92 stars 10 forks source link

Underclustering with batch correction #10

Closed scharch closed 1 year ago

scharch commented 1 year ago

When I use the batch parameter, I get only 1-2 clusters, even with alpha turned way up. One sample is longitudinal PBMC from a single donor, which should have (at least) monocytes, dendritic cells, NK cells, and B cells plus probably some platelets and some T cells. The other is pooled B cells from multiple donors and time points, for which there should be separate clusters for naive B cells, memory B cells, and plasmablasts. With batch correction, I can only get one cluster from sc-SHC. Without, I get 8 clusters (at alpha=0.05) including a clear batch effect (one cluster is entirely from a single sort).

FWIW, I am reanalyzing a Seurat object as so: clusters <- scSHC( seurat_object@assays$RNA@counts, [ batch=seurat_object$my.sample.labels ] ) I am doing it this way because it (a) I couldn't think of another easy way to combine different 10x lanes and (b) we 'hash' multiple samples into each 10x lane and this Seurat object already has the hashes demultiplexed (and doublets removed). But maybe there's a problem doing it this way that I'm missing?

Thanks!

igrabski commented 1 year ago

Hi Chaim, thanks for trying our tool! I have two questions. First, could you let me know the approximate size of your data (roughly how many cells / sample)? Second, how much biological variability do you expect across donors/time points, and in particular, is there any possibility the cluster with a clear batch effect represents a biological difference?

scharch commented 1 year ago

The single-donor sample is 21,348 cells. The B cell sample is 65,406 cells, 2-3000 per individual. Number per time point varies pretty widely (in particular, we only collected naive B cells at the first time point, and those account for nearly 75% of the total data). In theory, the time point that clusters separately could represent a biological difference, but in this case I'm pretty confident it's technical, based on a variety of other analyses. It's actually the last of the series, so nothing to do with the naive B cells, either.

Thanks!!

igrabski commented 1 year ago

Thanks, that's helpful, and I have a couple more questions. First, when you run scSHC without batch effects, do the other clusters appear well-mixed (within biological reason, given the different timepoints etc.) across batches besides the one clear batch effect, or do you suspect batch effects present throughout the clustering result? Second, when you are running scSHC with batch effect correction, are you specifying the batch label as the two samples, or as all the individual donors/timepoints?

The reason I'm asking is that the behavior you are seeing can happen if the batch labels overlap with biological signal, in which case trying to correct for the batch labels will "overcorrect" the data and give too conservative results. Although the one suspicious cluster might truly be a batch effect, if there are other ways in which batch labels are consistent with real biology, then batch correction will combine too many clusters. For example, if naive B cells truly only exist in one batch and not the other, and we tell scSHC to correct for batches, it will assume that finding a batch-specific cluster (the naive B cells) is incorrect, and will not allow that split to occur.