hoelzer-lab / rnaflow

A simple RNA-Seq differential gene expression pipeline using Nextflow
GNU General Public License v3.0
93 stars 20 forks source link

DESeq2 best practice #58

Closed hoelzer closed 4 years ago

hoelzer commented 4 years ago

We should check that DESeq2 is run in a best possible default way.

For example:

In version 1.16 (November 2016), the log2 fold change shrinkage is no longer default for the DESeq and nbinomWaldTest functions, by setting the defaults of these to betaPrior=FALSE, and by introducing a separate function lfcShrink, which performs log2 fold change shrinkage for visualization and ranking of genes. While for the majority of bulk RNA-seq experiments, the LFC shrinkage did not affect statistical testing, DESeq2 has become used as an inference engine by a wider community, and certain sequencing datasets show better performance with the testing separated from the use of the LFC prior. Also, the separation of LFC shrinkage to a separate function lfcShrink allows for easier methods development of alternative effect size estimators.

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#standard-workflow

We might also want to use the lfcShrink function.

MarieLataretu commented 4 years ago

I found two things: transformation and log2 fold change shrinkage.

Currently we do this:

rld <- rlog(dds)
vsd <- varianceStabilizingTransformation(dds)

(https://github.com/hoelzer-lab/rnaseq/blob/master/bin/deseq2.R#L465-L466) For varianceStabilizingTransformation we can use the wrapper vst. We might want to use blind=FLASE.

However, blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula. Note that only the fitted dispersion estimates from mean-dispersion trend line are used in the transformation (the global dependence of dispersion on mean for the entire experiment). So setting blind to FALSE is still for the most part not using the information about which samples were in which experimental group in applying the transformation.

(see http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#blind-dispersion-estimation)


Michael Love recommends lfcShrink also for bulk experiments. (https://support.bioconductor.org/p/95695/) At some point we need this:

dds <- DESeq(dds)
resultsNames(dds)
res <- lfcShrink(dds, coef=2)

There are three options for type apeglm, ashr and normal. Here is a comparison: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#moreshrink

It sound to me, that apeglm is the fastest. However, apeglm can not handle contrast directly. The vignette describes an equivalent. There is also a wrapper: https://rdrr.io/github/acidgenomics/DESeqAnalysis/man/apeglmResults.html

I'm not sure where the right point for the shrinkage is. At the pairwise comparisons or before?

MarieLataretu commented 4 years ago

ah, and one could create the DESeqDataSet with DESeqDataSetFromMatrix (instead of DESeqDataSetFromHTSeqCount) more directly from featureCounts output, if all samples are merged before in one count table. Would also be nice for the TPM filter to have one big table as input, but I see that as low priority right now.

hoelzer commented 4 years ago

Okay

transformation

For me, this sounds like

vsd <- vst(dds, blind=FALSE)
rld <- rlog(dds, blind=FALSE)

should be a reasonable default? What is then still not entirely clear to me: do we want to continue then with the vsd or rld table of transformed values for all the other steps? I think this is in many cases more of a visualization thing because we are using the transformed tables in cases such as

plotPCA(rld, intgroup=c("design"))

plot.ma(out.sub, deseq2.res, ma.size, vsd.sub)

plot.heat.countmatrix(out.sub, dds.sub, vsd.sub, col.labels.sub, 50)
plot.heat.fc(out.sub, deseq2.res, resFold, dds.sub, vsd.sub, col.labels.sub, 50)
plot.sample2sample(out.sub, dds.sub, vsd.sub, col.labels.sub)

Maybe using vsd consistently as a default is a good idea? I checked some threads but it seems to be difficult to define a default (https://support.bioconductor.org/p/98544/). It's more like looking at the PCA and then decide. Because it seems to depend on the number of samples and their overall differences.

lfcShrink

Ok, we should definitely do this. I also read

You can get the shrunken LFC either with lfcShrink like above or with betaPrior=TRUE

I think this betaPrior=TRUE in the

dds <- DESeq(ddsHTSeq, betaPrior = TRUE)

call was also something we needed to decide. But for me it seems like we should follow the default of the DESeq() function with betaPrior = FALSE and then use the new lfcShrink() function.

I think the shrinkage was done in older versions of DESeq directly during the DESeq() step so I suggest we use the lcfShrink() function directly after the DESeq() call:

dds <- DESeq(dds)
res <- lfcShrink(dds, coef=2)

and then again using the wrapper function while extracting the pairwise comparisons (see below)

And I agree, apeglm seems to be the most comprehensive and up-to-date type for the lfc shrinkage.

However, as you said we can not extract the pairwise comparisons from the object then using contrast like here: https://github.com/hoelzer-lab/rnaseq/blob/master/bin/deseq2.R#L595

deseq2.res <- results(dds, contrast=c("condition",l2,l1))

But then we can simply use the wrapper, right?

deseq2.res <- apeglmResults(dds, contrast = c("condition",l2,l1))

DESeqDataSetFromMatrix

Yes, I totally agree we can simplify the whole workflow by writing all counts into one table and then use this function instead of DESeqDataSetFromHTSeqCount. But one problem could be to always secure that we assign the correct condition, level, col.label, ... I always felt that I have more control over this when using a single featurecounts file per sample. But I am totally happy with changing this because I also see the advantages (such as calling featurecounts once on all BAMs, ...)

MarieLataretu commented 4 years ago

transformation

I'm good with vsd as default! I'd change that also in the plotting functions

MarieLataretu commented 4 years ago

done with #87