hbctraining / scRNA-seq_online

https://hbctraining.github.io/scRNA-seq_online/.
493 stars 175 forks source link

Confused about using "sum" for pseudobulk DESeq2 #74

Closed Famingzhao closed 2 years ago

Famingzhao commented 2 years ago

Hi, Thanks for your online lessons. In pseudobulk_DESeq2_scrnaseq.md part, I noticed "sum" function used for cluster-sample groups.

I am very confused about this step. Why not use "mean" function? I think cells number per samples would be very unbalanced for sampling or experimental reasons. For example, sample A may have 2000/5000 CD8T cells, while samle B may have 1000/2500 CD8T cells. So whether "sum" function would inflate the differentiation between pseudobulk groups?

# Aggregate across cluster-sample groups
pb <- aggregate.Matrix(t(counts(sce)), 
                       groupings = groups, fun = "sum") 
mistrm82 commented 2 years ago

Hi @Famingzhao,

We followed this tutorial (http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/muscWorkshop__vignette/, code under the "Aggregation of single-cell to pseudo-bulk data" section), and used the sum function as the tutorial did. However, you could also use other summary statistics, like mean or median.

There are studies that explore the difference in pseudobulk performance when using mean versus sum, and generally results have been comparable. We encourage you to peruse the literature and identify what would work best for you.