mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Request for a small test dataset #66

Closed stephanflemming closed 4 years ago

stephanflemming commented 4 years ago

Hi,

I am working on Galaxy wrappers for TEtoolkit.

Find the PR here https://github.com/galaxyproject/tools-iuc/pull/2967

Can you provide a small (< 1MB) test dataset? Especially the provided annotation files (--TE,--GTF) are to big and cannot be implemented.

Thanks in advance and stay healthy, Stephan

olivertam commented 4 years ago

Hi Stephan,

I can provide testing for TEtranscripts and TEcounts at this point. TEpeaks is still in beta, and is not ready for integration into Galaxy. I have attached smaller GTF files for the purposes of Galaxy testing, though they may not show the same results as with the fully GTF. dm3_refGene_trimmed.gtf.gz dm3_rmsk_TE_trimmed.gtf.gz You should be able to test with the single-end test data files on our server, but let me know if you're having trouble, and I will try another way to transfer them to you (Github doesn't like me to attach BAM files).

Thanks for all your efforts.

stephanflemming commented 4 years ago

Hi,

a run with

TEtranscripts --sortByPos --mode multi --TE te.gtf --GTF gtf.gtf --project result -t treatment1.bam treatment2.bam -c control1.bam control2.bam

gets stuck in the "gene-wise dispersion estimates" step. Some log infos that may help:

# ARGUMENTS LIST:
# name = result
# treatment files = ['treatment1.bam', 'treatment2.bam']
# control files = ['control1.bam', 'control2.bam']
# GTF file = gtf.gtf 
# TE file = te.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = True

I used the testdata_SE dataset and the annotations you provided above.

Thanks in advance, Stephan

olivertam commented 4 years ago

Hi Stephan,

May I ask which version of DESeq2 are you using? The "gene-wise dispersion estimations" come after the quantification step, and associated with the DESeq2 package. To confirm that quantification is done, you should have a result.cntTable file, and a result_DESeq2.R file. I'll be happy to take a look at those two files to see if there are issues with them.

Thanks

stephanflemming commented 4 years ago

Hi Oliver,

we are getting closer :-)

When applying --DESeq the following error message appears

Error in parametricDispersionFit(means, disps) : 
  Parametric dispersion fit failed. Try a local fit and/or a pooled estimation. (See '?estimateDispersions')
Calls: estimateDispersions ... estimateAndFitDispersionsFromBaseMeansAndVariances -> parametricDispersionFit

and [...]_gene_TE_analysis.txt and [...]_result_sigdiff_gene_TE.txt are not created.

I am still on R 3.6.1, which is the latest version in conda (see https://anaconda.org/r/r-base/). If that's the reason, I will remove that option until the new version is available.

Thanks!

olivertam commented 4 years ago

Hi Stephen,

The error message from DESeq is typical for these small datasets, as the information is too spare for the algorithm to perform parametric dispersion. I'll be happy to take a look at the result.cntTable and result_DESeq2.R files and see if they are also stuck at the same step as you. Thanks.

stephanflemming commented 4 years ago

Hi,

even with the large test dataset an error occurs when trying --DESeq.

TEtranscripts -t treatment1.bam treatment2.bam -c control1.bam control2.bam --TE te.gtf --GTF gtf.gtf --project result --DESeq --n DESeq_default

Dataset is taken from the testdata_SE folder, I renamed the files, but cannot check the original names because the website is not responding.

Error in parametricDispersionFit(means, disps) : 
  Parametric dispersion fit failed. Try a local fit and/or a pooled estimation. (See '?estimateDispersions')
Calls: estimateDispersions ... estimateAndFitDispersionsFromBaseMeansAndVariances -> parametricDispersionFit
In addition: Warning message:
glm.fit: algorithm did not converge 
Execution halted

result.zip

Another point, is it possible that the bioconda recipe needs to be updated? See https://bioconda.github.io/recipes/tetranscripts/README.html.

Thank you, Stephan

olivertam commented 4 years ago

Hi Stephan,

I'm afraid that we did not generate the bioconda recipe. I believe it was @dpryan79 who created it, and you might be correct that the dependencies might not have been updated. For testing purposes, we do not recommend using DESeq due to the error that you have experiencing. Without seeing the intermediate files that you generated (e.g. XXXX.cntTable and the corresponding DESeq2 script, XXXX_DESeq2.R) from the testing, it would be difficult to determine where the issue might be originating from (e.g. R dependencies downsteam of quantification).

Thanks.

stephanflemming commented 4 years ago

Hi,

I added .cntTable and _DESeq2.R already to my last comment. Have a look at the result.zip. Sorry, I should have mentioned that.

Thanks for your help, Stephan

olivertam commented 4 years ago

Hi Stephan,

It looks like you sent the DESeq script, not DESeq2. If you need to use the DESeq mode, you can try the following change to the R script:

cds <- estimateDispersions(cds,method="per-condition")

to

cds <- estimateDispersions(cds,method="per-condition",fitType="local")

Perhaps I'm not fully understanding what exactly you require for the Galaxy wrapper. Could you explain the setup that you need? Are you setting up/testing an XML wrapper, or are you also bundling the tools into a package? I apologize if these questions make no sense, as I haven't worked with Galaxy since 2015.

Thanks.

stephanflemming commented 4 years ago

Hi Oliver,

I just create the XML wrapper files for TEtranscript (and TEpeaks). Internal tests are required and I try out each parameter to take care that all dependencies are available, implement default/min/max values, etc.

TEtranscripts -t treatment1.bam treatment2.bam -c control1.bam control2.bam --TE te.gtf --GTF gtf.gtf --project result_deseq2

gives the follwoing result_deseq2.zip whereas

TEtranscripts -t treatment1.bam treatment2.bam -c control1.bam control2.bam --TE te.gtf --GTF gtf.gtf --project result_deseq --DESeq with cds <- estimateDispersions(cds,method="per-condition",fitType="local") works without errors and gives result_deseq.zip

What happens here? And is there a possibility that I hand over the fitType argument via command line?

Thanks, Stephan

olivertam commented 4 years ago

Hi Stephan,

Thanks for the information. We retained some compatibility with DESeq mainly for legacy purposes (in cases where initial analyses were done with DESeq, and could not be re-done with DESeq2). Most of our users are actually using DESeq2, and we would recommend that the Galaxy wrapper requires DESeq2 instead of DESeq.

To address your question, estimateDispersions would fail on certain datasets (both biological and simulated), and the "fix" is to add the fitType="local" parameter. However, this is a troubleshooting approach, and typically not recommended by the authors of DESeq for datasets that do not require it (parametric dispersion estimation is more robust from run to run).

Based on your output, it looks like the DESeq2 test run completed without major issues (which I believe was the initial issue posted here). Did you encounter other problems with that test run?

Please let me know how else I can help.

Thank you for your efforts.

stephanflemming commented 4 years ago

Hi,

I removed the --DESeq option. The test data files you provided work without errors (with all other parameters). So, that problem is solved. The wrapper will be included and can be found (and used) here Galaxy tool

Some points are left:

I removed TEpeaks until a none-beta status is reached.

As mentioned before, there are two - in my opinion redundant - recipes https://github.com/bioconda/bioconda-recipes/blob/master/recipes/tetoolkit/meta.yaml https://github.com/bioconda/bioconda-recipes/blob/master/recipes/tetranscripts/meta.yaml Was TEtoolkit once just renamed to TEtranscripts? I guess the first recipe is then deprecated.

Does it make sense to run TEtranscripts with only one treatment and control file? Is this equivalent to a TEcount run with the same dataset? Or in other words: is a separate wrapper for TEcount required?

As a hint: the parameter fragmentLength is not listed in stdout.

Thanks, Stephan

olivertam commented 4 years ago

Hi Stephan,

Thanks for setting up the TEtranscripts wrapper. TEpeaks is still in beta, but we hope to release it in the near future.

There was a rename of TEToolkit to TEtranscripts, thus, the first recipe should be deprecated. This was to reduce confusion, as we had initially wanted to release multiple tools in that tool suite, but ended up only releasing TEtranscripts.

I personally think a separate wrapper for TEcount is beneficial, but perhaps harder to integrate into the Galaxy framework. The main functionality of TEtranscripts/TEcount is to quantify the gene and transposable element expression of an RNA-seq library, with TEtranscripts also automatically generating an Rscript for DESeq2 analysis. However, since TEtranscripts run each library serially, the computing time increases significantly with each additional sample.

In contrast, TEcount allows the user to quantify libraries in parallel (one TEcount run per library), and thus speeds up the quantification step. However, the user would need to create their own Rscript to perform the DESeq2 differential analysis. While this provides greater flexibility in terms of mixing and matching samples (especially when adding new samples to a differential analysis), I am not sure if the Galaxy framework is amenable to user-provided R scripts (I haven't explored it for a long time). Thus, while TEcount is very useful for analyses with large number of samples, we don't know if Galaxy is the best place for it. I'd be happy to discuss this more.

You are correct that the fragment length parameter is not listed. I think that was because that feature is not as meaningful for RNA-seq data as it is for ChIP-seq, and thus we decided against reporting it. We will revisit this idea internally to see if it's worth reporting in the log file.

Thanks.

stephanflemming commented 4 years ago

Hi,

the wrapper of TEtranscripts is available here.

Feel free to test it :-)

Stephan

olivertam commented 4 years ago

Thank you!