mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

DESeq Normalization Questions #40

Closed jimcdonald closed 5 years ago

jimcdonald commented 5 years ago

I'm attempting to apply a minimum normalized read count filter to my TEtranscripts data. After looking at the R code output by TEtranscripts and at the code on github, I have three questions:

  1. The output for both DESeq and TCnorm is labelled baseMean. The values are different between the two, and from reading the code, the normalization is RPM for TCnorm. To confirm, the values labelled baseMean are the RPM normalized read counts and not some further modified value from DESeq, correct?

  2. In the code here on github, the argument parser code (line 63) notes: "Use DESeq (instead of DESeq2) for differential analysis." Why is it recommended to use the older version when doing a differential analysis? And why is DESeq2 used later on line 620 to do the differential analysis? I'm guessing the help text is just out of date, but I wanted to check.

  3. The code was updated on June 28, 2018. Given the issue in 2 above, I am guessing that one major feature that was added was to move to DESeq2. Does using DESeq2 drastically affect the performance such that I need to upgrade (we installed around January 2018)?

Thank you! James

olivertam commented 5 years ago

Hi James,

Thank you for your interest in TEtranscripts, and I will be happy to try and answer your questions.

  1. The "baseMean" output from DESeq (and DESeq2) is not RPM normalized read counts, but are generated "using the "median ratio method" described by Equation 5 in Anders and Huber (2010)." (see here for more details). Thus, although the headers are the same, the values generated from the two methods might not be the same.
  2. The addition of a "use DESeq" option was for backward compatibility. As you correctly guessed, we added DESeq2 in the June 2018 update due to requests from other users, but we wanted to keep the possibility to use DESeq for those who might need it.
  3. There are studies (such as this paper) that have compared DESeq2 to DESeq, and found it to be an improvement. A discussion thread involving the authors of DESeq tools have also pointed out that DESeq2 has an improved approach to handle type-I error. Other changes can be found here. The general observation is that you get more differentially expressed genes/TE with DESeq2, which the authors attribute to DESeq being overly conservative. You can find more information in the DESeq2 paper Unfortunately, we have not included the alternative normalization techniques (e.g. TC) that was implemented for DESeq, as we have found that the normalization performed by DESeq (and DESeq2) have produced most consistent results. If you are interested in using RPM values for normalization with DESeq2, I would be happy to help you with it.

Thanks.

jimcdonald commented 5 years ago

Thank you for offering to help with RPM normalization with DESeq2. Right now I've run everything with DESeq from the older version, so I may stick with that. Also, I think I prefer the DESeq median ratio method for normalization.

I'm interested in the TCnorm RPM values mostly just for filtering out DE genes with low read counts and could just be noise. I've read that filtering the low expressed genes or weighting the DE calls can improve DEG calling. If I may make an attempt to draw on your experience, have you experimented with this using TEtranscripts? Or does putting a threshold remove too many TEs? Do you have any ideas about a good threshold? I have a bit more of a wet-lab/molecular biology background so the idea of a TPM of 5 or so being about 1 read/cell makes sense to me. RPM seemed the closest I could get to that using the TEtranscripts output.

Thank you again for your assistance! James

olivertam commented 5 years ago

Hi James,

Unfortunately, it's currently not trivial to work with TPM for TEtranscripts, as we would need to treat each insertion as an independent transcript. Although this is possible for genic cases where there are less than 10 transcripts, it would become quite difficult for TE with 100s (or more) insertions (each of which could have variable lengths due to imprecise excision/insertion). However, it is something that we are thinking about, and would love to be able to resolve.

To answer your question, we have often found that most TE expression is typically above the thresholds that people use for genic analyses, and so we don't tend to lose too many of them. The DESeq2 vignette suggests at least 10 reads total (which is pretty lenient), and the author argues that using the independent filtering in the results() function is more useful than pre-filtering in terms of maximizing DEG calls (see that discussion here). Unfortunately, that function is only available in DESeq2.

One thing that I would like to highlight is that re-analysis with DESeq2 would not require you run TEtranscripts again. All you would need to do is to modify the R script file (XXXX.R) that was generated by the previous runs to utilize DESeq2, and then run the R script file using Rscript and the previously computed count tables (the generation of which is the time-consuming step of TEtranscripts). If you are interested, I'll be happy to help you modify the R script files to make use of DESeq2.

Thanks.

jimcdonald commented 5 years ago

Thanks for pointing out those sections of the DESeq2 vignette. I hadn't found those yet. Others advising me have pointed out the same problem you do concerning TPM. Considering that DESeq2 does some filtering anyway on the average counts, and it is easy to run DESeq2 on the count tables, I would like to do that.

Would the new script look something like the following (changes in bold)?

data <- read.table("RNAseq.A2780.library001.treat2.ITF/RNAseq.A2780.library001.treat2.ITF.DESeq..cntTable",header=T,row.names=1) groups <- factor(c(rep("TGroup",1),rep("CGroup",1))) min_read <- 1 data <- data[apply(data,1,function(x){max(x)}) > min_read,] library(DESeq2, quietly=T) cds <- newCountDataSet(data,groups) keep <- rowSums(counts(cds)) >= 10 cds <- cds[keep,] cds <- estimateSizeFactors(cds) cds <- estimateDispersions(cds,method="blind",sharingMode="fit-only",fitType="local") res <- nbinomTest(cds,"CGroup","TGroup") res_fc <- res write.table(res_fc, file="RNAseq.A2780.library001.treat2.ITF/RNAseq.A2780.library001.treat2.ITF.DESeq._gene_TE_analysis.txt", sep="\t",quote=F,row.names=F) resSig <- res_fc[(!is.na(res_fc$padj) & (res_fc$padj < 0.050000) & (abs(res_fc$log2FoldChange)> 0.000000)), ] write.table(resSig, file="RNAseq.A2780.library001.treat2.ITF/RNAseq.A2780.library001.treat2.ITF.DESeq._sigdiff_gene_TE.txt",sep="\t", quote=F, row.names=F)

Thanks! James

olivertam commented 5 years ago

Hi James,

One thing that I like about DESeq2 is that the commands are much shorter. Just a sidenote (based on your code example): DESeq2 does not like experiments with only one replicate, and will actually stop running in the latest versions (Version 1.20.0 or earlier will still handle no-replicate experiments).

I've added comments around lines where code is changed/added. Please let me know if anything is unclear

data <- read.table("RNAseq.A2780.library001.treat2.ITF/RNAseq.A2780.library001.treat2.ITF.DESeq..cntTable",header=T,row.names=1)
groups <- factor(c(rep("TGroup",1),rep("CGroup",1)))
min_read <- 1
data <- data[apply(data,1,function(x){max(x)}) > min_read,]

### code changes ###
## Pre-filtering
data <- data[rowSums(data) >=10,]

## DESeq2 section
sampleInfo <- data.frame(groups,row.names=colnames(data))
library(DESeq2, quietly=T)
dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ groups)
dds$groups <- relevel(dds$groups, ref="CGroup")
dds <- DESeq(dds,betaPrior=TRUE)
res <- results(dds,independentFiltering=TRUE)
write.table(res, file="RNAseq.A2780.library001.treat2.ITF/RNAseq.A2780.library001.treat2.ITF.DESeq._gene_TE_analysis.txt", sep="\t",quote=F,row.names=F)
### End of code changes ###

resSig <- res_fc[(!is.na(res_fc$padj) & (res_fc$padj < 0.050000) & (abs(res_fc$log2FoldChange)> 0.000000)), ]
write.table(resSig, file="RNAseq.A2780.library001.treat2.ITF/RNAseq.A2780.library001.treat2.ITF.DESeq._sigdiff_gene_TE.txt",sep="\t", quote=F, row.names=F)

If you want to implement the "independent filtering" that the authors of DESeq2 prefer over pre-filtering, I would replace the end of the above code (part starting with resSig) with the following:

resSig <- results(dds,alpha=0.05) # This is telling DESeq2 to first do independent filtering, and then obtain genes/TE that now meets the FDR value of 0.05
resSig <- resSig[abs(resSig$log2FoldChange) > 0,] # This is only important if you want to filter the results by log2 fold change, which your example does not require)
write.table(resSig, file="RNAseq.A2780.library001.treat2.ITF/RNAseq.A2780.library001.treat2.ITF.DESeq._sigdiff_gene_TE.txt",sep="\t", quote=F, row.names=F)

Please let me know if anything is confusing, or if you encounter issues when running the code. Thanks.

jimcdonald commented 5 years ago

Thanks! I got it to run, and I'm exploring the data. I think that filtering strategy both accomplishes my original goals and is probably the best I can get for TEtranscripts. Thanks again for your help!