mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

dm6 TE GTF file #102

Closed MiaPaw closed 2 years ago

MiaPaw commented 2 years ago

Hi,

I'd like to use TEtranscripts to compare TE expression between wildtype and mutant fly lines of Drosophila melanogaster. I found the TE GTF file for Drosophila on your website, however I have a few questions about it. Since the GTF file was created some time ago I wonder which version of the dm6 genome would be the best choice as referernce genome. Can I do the alignemnt to the current version r6.41 or would it be better to use a previous one?

The second part of my questions concerns the LTR- transposon families in the TE GTF file. Is there a special reason why for most of them the LTR- sequences and the internal sequences are counted seperatly? Which would be the best way to proceed with the two count values per LTR- Transposon? Would it be reasonable to add both values?

Thank you very much for your time.

olivertam commented 2 years ago

Hi,

Thank you for your interest in the software.

From what I understand, all r6.xx Flybase release are based off the same genome (dm6, though please ensure that the chromosome names match between the genomic sequence and your gene/TE annotations). The changes in the Flybase releases revolve around gene models/predictions/annotations, and thus it is up to you which release you would like to use for your gene annotations.

The TE annotations for dm6 are derived from RepeatMasker/Repbase. The LTR and internal sequences tend to be split up because different LTR could go with different internal sequences, and they could also be found independently of each other (e.g. soloLTR). Thus, it's not trivial to reassemble LTR elements without additional curation (which may require validation that the LTR and internal sequences are transcribed together). Therefore, we decide to retain the separation of annotations that the prediction algorithms have utilized.

Since TEtranscripts aggregates count from all annotated copies of a particular TE subfamily/element, a count for LTR elements could include both LTR that are part of a "complete" LTR TE, or from a solo-LTR. As such, it is not recommended to add the two annotations together. If you wish to do locus/copy-level quantification, you can try TElocal.

If you are aware of a better curated TE annotation (e.g. one that correctly combines the LTR and internal sequences in the genome), we would be happy to try to generate a TE GTF from that.

Thanks.