mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
218 stars 29 forks source link

Difference between provided GTF files and the one downloaded from UCSC #42

Closed MengjunWu closed 5 years ago

MengjunWu commented 5 years ago

Hi,

When comparing the provided TE GTF file (mm10) and the latest one I downloaded from UCSC, the number of TE elements is different, the one provided by you is 3725827, and the one from UCSC is 5147736.

Did you do some special filtering or just because the version difference (and could you update the mm10 if it is due to the version difference)?

Thanks!

Best, Mengjun

olivertam commented 5 years ago

Hi Mengjun,

Thanks for your interest in TEtoolkit. When you download from UCSC, are you getting the RepeatMasker track? That is also our starting source of information, but as you correctly deduced, we do filter out some repetitive features from the RepeatMasker track. These include low complexity stretches (e.g. A-rich sequences), simple repeats (e.g. TG dinucleotide repeats), rRNA, scRNA, snRNA, srpRNA and tRNA. We feel that these are technically not transposable elements, and thus we chose to not quantify them in our analyses. We have retained Satellite sequences, though we are still debating whether they should be removed as well. If you require a version of the GTF that contains these other sequences, please let me know and I can try to generate one for you.

Thanks.

MengjunWu commented 5 years ago

Thank you very much for the reply. We are currently only interested in transposable elements, so the provided GTF is sufficient :)