Importing DGE data into EnrichmentBrowser

grabearummc commented 4 years ago

Hi guys. I'm starting to work with EnrichmentBrowser, and I'm running into some issues. I'm looking at porting in DGE data from different sources (DESeq2, limma, and edgeR). My biggest problem is getting these objects into the proper format (SummarizedExperiment).

After exploring this repo some more, I noticed your changelog has a bullet for new import functions for deseq/limma/edgeR. This would be super helpful, so I don't have to write my own functions. So, how stable are they right now? Is the master branch "ok" to use in its current form?

lgeistlinger commented 4 years ago

Thanks for contacting us. I'd say you are good to go with the master branch, and you'll find documentation and working examples when using ?import after installation. See also the documentation of the import function here: https://bioconductor.org/packages/devel/bioc/manuals/EnrichmentBrowser/man/EnrichmentBrowser.pdf

I conducted a number of tests that were looking good, but you would be def among the first users, so I'd be interested in how this works for you and whether we need to extend the functionality further.

grabearummc commented 4 years ago

Alright, thank you that's great! I'll use this issue as a forum to discuss.

grabearummc commented 4 years ago

Consider adding in functionality for EnrichmentBrowser::idMap so that it automatically validates/converts ENSEMBL ids from id.version to id (e.g. ENSG00000002919.14 to ENSG00000002919). Try to conserve id.version by adding another column to rowData. This is really more of an issue with AnnotationDBI, but it couldn't hurt.

gsub("\\..*", "", row.names(ens_table))

lgeistlinger commented 4 years ago

let's make this another issue ("ENSEMBL ID version conversion"), and let's use the current issue for focusing on functionality of EnrichmentBrowser::import -- renaming the issue for clarity

vivek-verma202 commented 4 years ago

@lgeistlinger , thank-you for the update! (esp. import).

# I used apeglm for the res:
res <- lfcShrink(dds, coef="FM_1_vs_0", type="apeglm")
se <- import(dds, res, from = "DESeq2")
Error in .importFromDESeq2(obj, res) : 
  all(rnames %in% colnames(res)) is not TRUE
colnames(res)
[1] "baseMean"       "log2FoldChange" "lfcSE"         
[4] "pvalue"         "padj"  

# repeated without "apeglm"
res <- results(dds)
se <- import(dds, res, from = "DESeq2")
Error in .importFromDESeq2(obj, res) : 
Supported experimental designs include binary group comparisons 
with an optional blocking variable for paired samples / sample batches

design(dds)
~ age + batch + condition

I have 4 questions for you:

Is LFC shrinkage needed / recommended with EnrichmentBrowser?
I have to use age (scaled, continuous) and batch (binary) as covariates to analyze for my condition of interest (binary)? How can I use this information with EnrichmentBrowser to avoid any false positives?
To circumvent covariate problem, I was thinking of using ranked gene list with pi scores res$pi <- res$log2FoldChange*(-log(res$pvalue)), can I do this for downstream topology-based methods in EnrichmentBrowser, if yes, how?
Could you suggest a better way to rank / score genes? I am not sure how should the tie broken in case of non-unique scores.

lgeistlinger commented 4 years ago

Thanks for testing + raising these issues @vivek-verma202! I am currently sprinting towards a deadline for a grant proposal on Sunday, but I am looking into this + will get back to you on Monday.

vivek-verma202 commented 4 years ago

Thanks, good-luck with your grant application!

grabearummc commented 4 years ago

Alright here is an update for you @lgeistlinger.

I've been using import and I have a few suggestions. For importing limma data I think there needs to be a way for using data from the limma-trend approach. Check page 72 on the limma user guide.

Instead of using voom: https://github.com/lgeistlinger/EnrichmentBrowser/blob/4357b8004cdcd70093ec69b5b1448bd019fad77c/R/import.R#L120-L132

You would use the limma-trend method:

#'   # (5) import from trend/limma (RNA-seq count data)
#'   # (5a) create the expression data object
#'   library(limma)
#'   keep <- filterByExpr(counts, rdesign)
#'   
> logCPM <- edgeR::cpm(counts[keep,], log=TRUE, prior.count=3)
#'   # (5b) obtain differential expression results 
> fit <- lmFit(logCPM, design)
> fit <- eBayes(fit, trend=TRUE)
> res <- topTable(fit, coef=ncol(design))

#'
#'   # (5c) import???
#'   se <- import(el, res)

vivek-verma202 commented 4 years ago

Hi @lgeistlinger, hope your grant application went well.

I was wondering if you had a chance to look into the "limitations" of current import()? As far as my question 1 is concerned, some reading made me realize that LFC shrinkage (apeglm) is not necessary esp. if I have already filtered out poorly expressed genes. Hence, was planning to ditch it (unless you've something to add to it).

Hope to hear from you, soon.

lgeistlinger commented 4 years ago

Hi @vivek-verma202,

Thanks for your patience.

The application has been a sprint involving daily 12 h shifts for the last 7 days straight - but I am pretty happy with the product. Thanks for asking.

Concerning your questions:

lfcShrinkage: According to the DESeq2 vignette only relevant for visualization and ranking of genes when having not filtered on not sufficiently expressed genes first. In the EnrichmentBrowser, I typically use edgeR::filterByExpr (but you can easily apply customized thresholds as well) for that as the very first step of processing of the count matrix, ie before normalization, DE analysis, and GS analysis. Thus: if you've filtered before - yes, shrinkage is not needed. That said, I think the error that you encounter in your first post is mainly a result of your result object (res) not having the same number of rows (genes) as your dds. I suspect rows/genes been removed by lfcShrink?
covariates: design ~ batch + condition should work right out of the box (can you confirm?). Supported experimental designs currently only include binary group comparisons with an optional blocking variable for paired samples / sample batches. Adding additional covariates is currently not supported and I would take this as a feature request and will invest some time to implement that. Note, however that only regression GS methods such as camera and roast support extended experimental designs and are able to make use of this information. Something that you could do right away:

design ~ age + condition if you would form here a number of discrete age groups.

(inspecting resulting DE measures against design ~ condition only would btw give you an indication whether there are indeed age-specific effects).

Downstream topology-based methods:

If you are interested in those, or in general methods that only work on the DE
results, such as eg also ora, then you can actually ignore my remarks on experimental design above and

i) make use of the sig.stat argument that can be supplied to EnrichmentBrowser::sbea and EnrichmentBrowser::nbea.

• beta: Log2 fold change significance level. Defaults to 1 (2-fold).

• sig.stat: decides which statistic is used for determining significant DE genes. Options are:

        • 'p' (Default): genes with adjusted p-value below
              alpha.

        • 'fc': genes with abs(log2(fold change)) above beta

        • '&': p & fc (logical AND)

        • '|': p | fc (logical OR)

        • 'xxp': top xx % of genes sorted by adjusted p-value

        • 'xxfc' top xx % of genes sorted by absolute log2 fold
                 change.

Example: if you wanted to conduct a SPIA analysis with genes rendered differentially expressed (DE genes) that have an absolute log2 fold change > 1 and an adjusted p-value < 0.1, you would call:

spia.res <- nbea("spia", se, gs, grn, alpha = 0.1, beta = 1, sig.stat = '&')

ii) If not otherwise specified (eg via the sig.stat argument), the default point of reference for the EnrichmentBrowser functions when it comes to gene DE measures is the ADJ.PVAL column in the rowData slot of your SummarizedExperiment (ie. your dds after using import).

   se <- import(dds, res)

You can always override / abuse this column by eg setting it

   rowData(se)[my.DE.genes.of.interest, "ADJ.PVAL"] <- 0.04
   rowData(se)[notDE.notInteresting.genes, "ADJ.PVAL"] <- 0.06

as per default, sbea and nbea methods that work only on a list of DE genes will take genes with ADJ.PVAL < alpha (default alpha=0.05).

Note the: xxp or xxfc options of the sig.stat argument above are an alternative to that, allowing to select a certain percentage of genes that you find interesting based on log2 fold change or p-value (and can also be accordingly tweaked as above).

When it comes to ranking, the raw p-value typically has better granularity than the adjusted p-value. Accordingly:
```
res <- DESeq2::results(dds)
sorting.df <- res[,c("pvalue","log2FoldChange")]
sorting.df[,2] <- -abs(sorting.df[,2])
ind <- do.call(order, as.data.frame(sorting.df)))
```
would give you a very fine-grained order by nominal p-value, and for genes with equal nominal p-value it would sort by absolute fold change in decreasing order. You could eg use that ordering after importing via:
```
se <- se[ind,] 
```
(if that still doesn't result in a satisfying ranking, you could incorporate additional measures such as the testing statistic by modifying the above line to:
```
sorting.df <- res[,c("pvalue","log2FoldChange", "stat")]
```

Hope that helps and don't hesitate to inquire further if some things are unclear.

vivek-verma202 commented 4 years ago

Thank you so much, @lgeistlinger ! Here are the updates:

design = ~ batch + condition worked! (Results were not very different with or without age, thanks to you, age is no more covariate of interest in my analysis.)
I could not proceed without normalization:
```
res <- sbea(method = "gsea", se = se, gs = go.bp.gs, browse = T)
Error in .reorderAssays(se, assay) : 
Expression dataset (se) does not contain an assay named "norm"
```
Normalization worked but excluded ~91 % of genes:
```
se <- normalize(se, norm.method = "vst")
Excluding 8118 genes not satisfying min.cpm threshold
```
The results were from DESeq2 and not from raw read counts. Considering the vignette, I am not sure if normalization is needed. Also, DESeq2 output already has normalization factors:

How can I make use of DESeq2's normalization factors to normalize the count slot and create "norm" slot in an SE? (I think, it's an important consideration for the import() function.)

vivek-verma202 commented 4 years ago

I came with a work around, not sure if it's correct, please let me know opinion:

se@assays@data@listData[["norm"]] <- se@assays@data@listData[["counts"]]/se@assays@data@listData[["normalizationFactors"]]

Looks that counts were normalized: hist(log(se@assays@data@listData[["counts"]])) hist(log(se@assays@data@listData[["counts"]]))

lgeistlinger commented 4 years ago

Hi @vivek-verma202 -

se <- normalize(se, norm.method = "vst")

is the right thing todo here. I'll add an argument to normalize to switch off filtering.

There are some helpful accessor functions such as assay, colData, and rowData for a SummarizedExperiment. See also the graphical overview of a SummarizedExperiment and how to access its main parts here.

A fast approximation of what EnrichmentBrowser::normalize() with method = "vst" does:

assay(se, "norm") <- edgeR::cpm(assay(se, "counts"))

But I agree, that's something that import should support. I'll make a dedicted issue.

lgeistlinger commented 4 years ago

Closing this and will continue working on the specific issues derived from the discussion here (#28 and #29). Feel free to reopen + thanks a lot for testing and feedback.

lgeistlinger / EnrichmentBrowser

Importing DGE data into EnrichmentBrowser #23