mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Treatment vs control and DeSeq2 alternative #24

Closed GiuliaPasquesi closed 5 years ago

GiuliaPasquesi commented 6 years ago

I am trying to perform a comparative analysis very similar to the one performed for Fig.5 in the TETranscript associated paper. I would imagine that the Testis sample was set as treatment for DeSeq, and then for the integration of the somatic tissues each was assigned as control (e.g, -c Liver.bam Kidney.bam ...). Am I correct?

Additionally, since DeSeq2 is now recommended over DeSeq for DE analyses, could you provide an equivalent normalization script (for germline vs somatic tissues with no replicates) in DeSeq2? Or does it have to be DeSeq necessarily?

Thank you so much,

Giulia

olivertam commented 6 years ago

Hi Giulia,

Yes, if you wish to compare testis vs all other tissues, then you would assign the other somatic tissues as control: -t testis.bam -c liver.bam kidney.bam ...

With regards to DESeq2, we are currently adding that functionality into TEtranscripts. As a temporary work-around, you can run TEtranscripts to generate the count table (XXXX.cntTable), and then load the resulting count table into DESeq2 for analysis (and use their internal normalization methodology).

Please note DESeq2's warning about experiments without replicates:

Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis. If an experimental design is supplied which does not contain the necessary degrees of freedom for differential analysis, DESeq will provide a warning to the user and follow the strategy outlined in Anders and Huber (2010) under the section ’Working without replicates’, wherein all the samples are considered as replicates of a single group for the estimation of dispersion. As noted in the reference above: "Some overestimation of the variance may be expected, which will make that approach conservative." Furthermore, "while one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation." We provide this approach for data exploration only, but for accurately identifying differential expression, biological replicates are required.

Thanks. Cheers, Oliver

GiuliaPasquesi commented 6 years ago

Hello Oliver, Thank you so much for the clarification about the controls. That's how I set up the analyses in the first place, but before going into the normalization itself I wanted to make sure I was on the right path. As for the normalization in DeSeq2 I read a lot of posts and that section. A friend of mine run TEtoolkit and then used the .cntTable for DeSeq2, but he had replicates. This is the backbone of his script:

GeneCountData <- read.delim("xxx.cntTable", header = TRUE, sep = "\t", row.names = 1) GeneCountDataInt <- as.matrix(GeneCountData) storage.mode(GeneCountDataInt) = "integer" #Although they already are, being the raw counts from TEtoolkit Design <- read.delim("coldata.txt", header = TRUE, sep = "\t", row.names = 1) library("DESeq2") gdds <- DESeqDataSetFromMatrix(countData = GeneCountDataInt, colData = Design, design = ~ xxx)

gdds_de <- DESeq(gdds) rld <- rlog(gdds) res <- results(gdds_de) summary(res)

I was also considering the lfcShrink function to visualize/rank the TEs.

Do you think this would be appropriate? (Sorry for all the questions, I'm not exactly a RNAseq person yet, still learning, and TEs are not an easy set to start with).

Thank you again, Giulia

olivertam commented 6 years ago

Hi Giulia,

One approach is to do variance-stabilizing transformation to generate the normalized/variance-stabilized count matrix. You can then generate heatmaps using the transformed count data, and order them based on l2fc or FDR as desired.

Thanks. Cheers, Oliver

GiuliaPasquesi commented 6 years ago

Hello Oliver, I have another quick question. The .cntTable contains information for both genes and TEs. In principle, one should give the whole table for normalization purposes, however all variance plots (vst, rlog, lfcShrink) are going to be affected by both TE and gene counts, and in a heatmap it is going to be even worst. What would you suggest? Filtering a priori from the cntTable all the genes and pass to DeSeq only the TEs doesn't seem ideal either.

Thank you so much,

Giulia


From: Oliver Tam notifications@github.com Sent: Thursday, May 24, 2018 1:49:26 PM To: mhammell-laboratory/tetoolkit Cc: Pasquesi, Giulia Irene Maria; Author Subject: Re: [mhammell-laboratory/tetoolkit] Treatment vs control and DeSeq2 alternative (#24)

Hi Giulia,

One approach is to do variance-stabilizing transformationhttps://rdrr.io/bioc/DESeq2/man/varianceStabilizingTransformation.html to generate the normalized/variance-stabilized count matrix. You can then generate heatmaps using the transformed count data, and order them based on l2fc or FDR as desired.

Thanks. Cheers, Oliver

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mhammell-laboratory/tetoolkit/issues/24#issuecomment-391820666, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALdn2BChSPb6Bxd_LwCbZvpCeMcOI6Jbks5t1wC2gaJpZM4UMlQD.

olivertam commented 6 years ago

Hi Giulia,

We have been using the whole table for normalization and differential expression, and accept that TE and genes should both factor into the variance calculation. For the purposes of plotting, we have sometimes separated TE and genes after normalization, variant stabilization and differential analysis, and plot them separately if we suspect that there are huge differences in their behavior. In many cases, we are interested in differential expression, so we tend to scale our heatmaps by Z-score across samples (and thus scale each gene/TE independently for the purposes of visualization). Please let me know if I failed to answer your question. Thanks.

Cheers, Oliver

GiuliaPasquesi commented 6 years ago

Hello Oliver,

Thank you again for the clarifications. I indeed always normalize the full set, but when plotting the variance-stabilized SD ( "meanSdPlot(assay(vsd))" ) or the log one ( "meanSdPlot(assay(rld))" ) the SD are huge, and in the PCA plot are some somatic tissues that cluster differentially. However, when I generate a heat map of TE only (after calling DeSeq and exporting the counts) the somatic tissues all look pretty similar. In the end, the plots are visually appealing but not necessary, so it's not a huge deal.

Thank you so much again,

Bests

Giulia


From: Oliver Tam notifications@github.com Sent: Wednesday, June 6, 2018 10:32:59 PM To: mhammell-laboratory/tetoolkit Cc: Pasquesi, Giulia Irene Maria; Author Subject: Re: [mhammell-laboratory/tetoolkit] Treatment vs control and DeSeq2 alternative (#24)

Hi Giulia,

We have been using the whole table for normalization and differential expression, and accept that TE and genes should both factor into the variance calculation. For the purposes of plotting, we have sometimes separated TE and genes after normalization, variant stabilization and differential analysis, and plot them separately if we suspect that there are huge differences in their behavior. In many cases, we are interested in differential expression, so we tend to scale our heatmaps by Z-score across samples (and thus scale each gene/TE independently for the purposes of visualization). Please let me know if I failed to answer your question. Thanks.

Cheers, Oliver

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mhammell-laboratory/tetoolkit/issues/24#issuecomment-395281712, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALdn2BADtNaJrAuZ2k262erHy2jOrkTUks5t6J7rgaJpZM4UMlQD.