mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
218 stars 29 forks source link

TEtranscripts and covariates values #34

Closed vasilislenis closed 5 years ago

vasilislenis commented 5 years ago

Hi Oliver,

My apologies for bothering you again but I was wondering if I could have your lights for one more time. I would like to run TEtranscripts on a dataset but also I would like to use a couple of covariates in the DESeq normalization phase that plays a significant role in my data (e.g. pH of the sample selection, age, percentage of uniquely aligned reads). Is it possible to do that with TEtranscripts? Is there any specific parameter that I can use as input that can parse these details?

Thank you very much in advance, Vasilis.

olivertam commented 5 years ago

HI Vasilis,

The simplest way to do this is to modify the DESeq2 file that is generated by TEtranscripts (after the initial run) to add your covariates. Example: If this is the DESeq2 file generated (testcase_DESeq2.R)

data <- read.table("testcase.cntTable",header=T,row.names=1)
groups <- factor(c(rep("treatment",4),rep("control",2)))
...
dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ groups)
...

You can add your covariate (e.g. age) to the DESeq2 dataset

data <- read.table("testcase.cntTable",header=T,row.names=1)
groups <- factor(c(rep("treatment",4),rep("control",2)))
age <- factor(c("above80","above80","below80","below80","above80","below80")
...
dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ age + groups)
...

I would also change the file names of the output, so you don't override the original results:

...
write.table(res, file="withCovariate_gene_TE_analysis.txt", sep="\t",quote=F)
...
write.table(resSig, file="withCovariate_sigdiff_gene_TE.txt",sep="\t", quote=F)

You can then run Rscript testcase_DESeq2.R, and assuming you didn't change the name of your input file, it should now run DESeq2 with your covariates and output two new files: withCovariate_gene_TE_analysis.txt and withCovariate_sigdiff_gene_TE.txt.

Please also check out the guide to using DESeq2 and the DESeq2 reference manual for more information on how to modify the DESeq2 commands.

Thanks

vasilislenis commented 5 years ago

Hi Oliver,

Thank you very much for your response.

It is a kind of relief that I don’t have to interfere in the intermediate steps :)

I will try to do it and let you know.

Thank you very much for your help, Vasilis.

On 24 Oct 2018, at 17:01, Oliver Tam notifications@github.com wrote:

HI Vasilis,

The simplest way to do this is to modify the DESeq2 file that is generated by TEtranscripts (after the initial run) to add your covariates. Example: If this is the DESeq2 file generated (testcase_DESeq2.R)

data <- read.table("testcase.cntTable",header=T,row.names=1) groups <- factor(c(rep("treatment",4),rep("control",2))) ... dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ groups) ... You can add your covariate (e.g. age) to the DESeq2 dataset

data <- read.table("testcase.cntTable",header=T,row.names=1) groups <- factor(c(rep("treatment",4),rep("control",2))) age <- factor(c("above80","above80","below80","below80","above80","below80") ... dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ age + groups) ... I would also change the file names of the output, so you don't override the original results:

... write.table(res, file="withCovariate_gene_TE_analysis.txt", sep="\t",quote=F) ... write.table(resSig, file="withCovariate_sigdiff_gene_TE.txt",sep="\t", quote=F) You can then run Rscript testcase_DESeq2.R, and assuming you didn't change the name of your input file, it should now run DESeq2 with your covariates and output two new files: withCovariate_gene_TE_analysis.txt and withCovariate_sigdiff_gene_TE.txt.

Please also check out the guide to using DESeq2 https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html and the DESeq2 reference manual https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf for more information on how to modify the DESeq2 commands.

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/tetoolkit/issues/34#issuecomment-432718343, or mute the thread https://github.com/notifications/unsubscribe-auth/AGhlfxYjjCqWHweSGnvfiFa-E3BE7WRIks5uoI7cgaJpZM4X4Khb.

vasilislenis commented 5 years ago

Hi Oliver,

My sincere apologies for bothering you again, but I’m trying to adjust the script based on what you have proposed and I fount that it doesn’t call the “DESeqDataSetFromMatrix” function anywhere. Instead of it, it calls the “newCountDataSet”, “estimateSizeFactors”, “estimateDispersions”, and “nbinomTest”.

I have attached the script in case that could take a look at it, if you have the time.

Thank you very much again for your support and your time, Vasilis.

On 25 Oct 2018, at 13:04, Vasilis Lenis vasilislenis@gmail.com wrote:

Hi Oliver,

Thank you very much for your response.

It is a kind of relief that I don’t have to interfere in the intermediate steps :)

I will try to do it and let you know.

Thank you very much for your help, Vasilis.

On 24 Oct 2018, at 17:01, Oliver Tam <notifications@github.com mailto:notifications@github.com> wrote:

HI Vasilis,

The simplest way to do this is to modify the DESeq2 file that is generated by TEtranscripts (after the initial run) to add your covariates. Example: If this is the DESeq2 file generated (testcase_DESeq2.R)

data <- read.table("testcase.cntTable",header=T,row.names=1) groups <- factor(c(rep("treatment",4),rep("control",2))) ... dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ groups) ... You can add your covariate (e.g. age) to the DESeq2 dataset

data <- read.table("testcase.cntTable",header=T,row.names=1) groups <- factor(c(rep("treatment",4),rep("control",2))) age <- factor(c("above80","above80","below80","below80","above80","below80") ... dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ age + groups) ... I would also change the file names of the output, so you don't override the original results:

... write.table(res, file="withCovariate_gene_TE_analysis.txt", sep="\t",quote=F) ... write.table(resSig, file="withCovariate_sigdiff_gene_TE.txt",sep="\t", quote=F) You can then run Rscript testcase_DESeq2.R, and assuming you didn't change the name of your input file, it should now run DESeq2 with your covariates and output two new files: withCovariate_gene_TE_analysis.txt and withCovariate_sigdiff_gene_TE.txt.

Please also check out the guide to using DESeq2 https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html and the DESeq2 reference manual https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf for more information on how to modify the DESeq2 commands.

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/tetoolkit/issues/34#issuecomment-432718343, or mute the thread https://github.com/notifications/unsubscribe-auth/AGhlfxYjjCqWHweSGnvfiFa-E3BE7WRIks5uoI7cgaJpZM4X4Khb.

olivertam commented 5 years ago

Hi Vasilis,

It appears that you're probably running an older version of TEtranscripts, which was using DESeq (not DESeq2). The current version now uses DESeq2 by default (but still retains compatibility to DESeq).

However, you should not need to redo the intermediate steps regardless. Here is a DESeq2 script that should work directly on the count table file that was generated (XXXX.cntTable).

data <- read.table("testFile.cntTable",header=T,row.names=1)
groups <- factor(c(rep("TGroup",2),rep("CGroup",2)))
age <= factor(c("old","young","old","young"))
min_read <- 1
data <- data[apply(data,1,function(x){max(x)}) > min_read,]
sampleInfo <- data.frame(groups,row.names=colnames(data))
library(DESeq2, quietly=T)
dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ age + groups)
dds$condition = relevel(dds$groups,"CGroup")
dds <- DESeq(dds)
res <- results(dds,independentFiltering=F)
write.table(res, file="testCase_gene_TE_analysis.txt", sep="\t",quote=F)
resSig <- res[(!is.na(res$padj) & (res$padj < 0.050000) & (abs(res$log2FoldChange)> 0.000000)), ]
write.table(resSig, file="testCase_sigdiff_gene_TE.txt",sep="\t", quote=F)

Please make sure that you have DESeq2 installed (see here for installation instructions) Let me know if you still have issues. Thanks

vasilislenis commented 5 years ago

I have tried it but unfortunately doesn’t work.

It looks like that there is an error in the “design” statement.

"Error in DESeqDataSet(se, design = design, ignoreRank) : all variables in design formula must be columns in colData Calls: DESeqDataSetFromMatrix -> DESeqDataSet"

Btw. I have to install the latest version of TEtrascripts. :)

Kind regards, Vasilis.

On 2 Nov 2018, at 15:36, Oliver Tam notifications@github.com wrote:

Hi Vasilis,

It appears that you're probably running an older version of TEtranscripts, which is using DESeq (not DESeq2). Here is a DESeq2 script that should work directly on the count table file that was generated (XXXX.cntTable).

data <- read.table("testFile.cntTable",header=T,row.names=1) groups <- factor(c(rep("TGroup",2),rep("CGroup",2))) age <= factor(c("old","young","old","young")) min_read <- 1 data <- data[apply(data,1,function(x){max(x)}) > min_read,] sampleInfo <- data.frame(groups,row.names=colnames(data)) library(DESeq2, quietly=T) dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ age + groups) dds$condition = relevel(dds$groups,"CGroup") dds <- DESeq(dds) res <- results(dds,independentFiltering=F) write.table(res, file="testCase_gene_TE_analysis.txt", sep="\t",quote=F) resSig <- res[(!is.na(res$padj) & (res$padj < 0.050000) & (abs(res$log2FoldChange)> 0.000000)), ] write.table(resSig, file="testCase_sigdiff_gene_TE.txt",sep="\t", quote=F) Please make sure that you have DESeq2 installed (see here https://bioconductor.org/packages/release/bioc/html/DESeq2.html for installation instructions) Let me know if you still have issues. Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/tetoolkit/issues/34#issuecomment-435420307, or mute the thread https://github.com/notifications/unsubscribe-auth/AGhlfxdSzcrNEqXyQXSJR5ulo7wC5hM4ks5urGaGgaJpZM4X4Khb.

olivertam commented 5 years ago

Hi,

Sorry about that. Please change the following line from:

sampleInfo <- data.frame(groups,row.names=colnames(data))

to

sampleInfo <- data.frame(groups,age,row.names=colnames(data))

This adds the age into the sample information required for the DESeq2 object Thanks

vasilislenis commented 5 years ago

Hi Oliver,

Yes, it runs, thank you very much!

But my problem is that I don’t think that it calculates what I am asking for…

For example, when I am applying DESeq2 on my data without taking account the covariates I am getting 1287 significant genes/TEs (422 of them are TEs). When I am using just the age as a covariate the results are very frustrating (only 1). I am sure that I am doing something wrong. First of all, I am using exact ages (e.g. 45, 58, etc.) and not scales like "old", "young," etc. If I convert the ages into a binary variable (if age < 50 = "young”, if age > 50 = “old”), I am taking as results 3246 significant genes/TEs (604 of them are TEs), but is this the right approach? And what will happen in the case that I want to use the other features that I would like to use as covariates? How can be expressed as a binary the unique aligned reads percentage or the pH?

My sincere apologies for this kind of questions, since they are not so relevant with the way that the TEtranscripts works, but I definitely would like your lights and your experience on this.

Thank you very much, Vasilis.

On 2 Nov 2018, at 18:43, Oliver Tam notifications@github.com wrote:

Hi,

Sorry about that. Please change the following line from:

sampleInfo <- data.frame(groups,row.names=colnames(data)) to

sampleInfo <- data.frame(groups,age,row.names=colnames(data)) This adds the age into the sample information required for the DESeq2 object Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/tetoolkit/issues/34#issuecomment-435453661, or mute the thread https://github.com/notifications/unsubscribe-auth/AGhlf-YqJW7BRlD8ulBluYvi5-Fzq_GAks5urJJKgaJpZM4X4Khb.

olivertam commented 5 years ago

Hi Vasilis,

I would recommend reading the DESeq2 vignette for more detailed information on how DESeq2 handles continuous covariates. In particular, I think this section is important to note:

Continuous covariates can be included in the design formula in exactly the same manner as factorial covariates, and then results for the continuous covariate can be extracted by specifying name. Continuous covariates might make sense in certain experiments, where a constant fold change might be expected for each unit of the covariate. However, in many cases, more meaningful results can be obtained by cutting continuous covariates into a factor defined over a small number of bins (e.g. 3-5). In this way, the average effect of each group is controlled for, regardless of the trend over the continuous covariates. In R, numeric vectors can be converted into factors using the function cut.

To me, this means that if you're expecting a constant change due to every change in the variable, then the continuous variable makes sense (for example, if change in percentage mapped is correlated with a consistent increase/decrease in gene expression). However, if you're expecting that the gene expression change is more obvious after X units of change (e.g. difference per 10 years in age, or every 0.5 change in pH), it would be better to bin the values accordingly (binary was my simplistic approach, but could be applicable if you want to assess based on whether they are above or below median/accepted cutoff). It really depends on how you think the covariate could be affecting your data.

Hope this is helpful. Thanks.

vasilislenis commented 5 years ago

Hi Oliver,

Thank you very much for your thorough explanation. I believe that the “cutting” approach is the more appropriate to me, but firstly I have to find the correlation between the variables changing and the expected expression values, a thing that I believe will not be so trivial, I’m afraid.

Many thanks, Vasilis.

On 5 Nov 2018, at 17:05, Oliver Tam notifications@github.com wrote:

Hi Vasilis,

I would recommend reading the DESeq2 vignette <x-msg://3/bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html> for more detailed information on how DESeq2 handles continuous covariates. In particular, I think this section is important to note:

Continuous covariates can be included in the design formula in exactly the same manner as factorial covariates, and then results for the continuous covariate can be extracted by specifying name. Continuous covariates might make sense in certain experiments, where a constant fold change might be expected for each unit of the covariate. However, in many cases, more meaningful results can be obtained by cutting continuous covariates into a factor defined over a small number of bins (e.g. 3-5). In this way, the average effect of each group is controlled for, regardless of the trend over the continuous covariates. In R, numeric vectors can be converted into factors using the function cut.

To me, this means that if you're expecting a constant change due to every change in the variable, then the continuous variable makes sense (for example, if change in percentage mapped is correlated with a consistent increase/decrease in gene expression). However, if you're expecting that the gene expression change is more obvious after X units of change (e.g. difference per 10 years in age, or every 0.5 change in pH), it would be better to bin the values accordingly. It really depends on how you think the covariate could be affecting your data.

Hope this is helpful. Thanks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/tetoolkit/issues/34#issuecomment-435953313, or mute the thread https://github.com/notifications/unsubscribe-auth/AGhlfxhfolNxqaSLNkqOOks_-aFZtg1cks5usG_RgaJpZM4X4Khb.