mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

TEcount: does it count intronic TE? #150

Closed ddepierre closed 9 months ago

ddepierre commented 9 months ago

Hi,

The gene GTF that I use in as input of the --GTF i s a full GTF with genes, transcripts, exons, non coding, long non coding.... I was wondering if everything overlapping with any enter of my gene GTF is removed from TE reads quantif or is there a filter to avoid loosing TE overlapping with lncRNA or intronic TE ?

It is unclear to me if reads overlapping TE in introns are counts as transcripts from TE in TEcount. How can I be sure to include or exclude intronic TE?

Thanks, David

olivertam commented 9 months ago

Hi David,

Thanks for your interest in the software. You are correct that there are a lot of lncRNA that could confound TE quantification. The approach that TEcount takes is as follows:

By default, intronic TE will be quantified as introns are not considered genic annotations. If you want to exclude intronic TE, then you would need to remove those entries from the TE GTF prior to running TEcount. You could do the following (with bedtools:

$ awk '$3=="transcript"' [Gene GTF] > [Gene Transcript GTF]
$ intersectBed -v -a [TE GTF] -b [Gene Transcript GTF] > [no intronic TE GTF]

Let me know if you have other questions. Thanks..

ddepierre commented 9 months ago

Hi Oliver, Thanks for your superquick anwer!

If I use as input a gene GTF containing only type='gene' (column 3), will TEcount exclude all the counts overlapping with the full gene body including introns? Or does TEcount base it filter on type='transcript'? (in this case I guess that nothing wil be excluded)

Ok I have to modify my alignment parameters then, because I want to test quickly something and didn't take the time to realign allowing multiple mapping.

How many multiple mapping reads you recommend to allow to better detect TE? I am working with mouse total RNAseq, paired-end stranded.

But still I have signal on TE with uniquely mapped reads only, then how using uniquely mapped reads would affect TE diff expr analysis ? Will I have more "old" TE because they have more mutation so are more unique, but not so "old" because older are silenced at some point? I am totally new to TE so please excuse me if it is dumb questions...

Thanks, David

olivertam commented 9 months ago

Hi,

TEcount works on the GTF lines with "exon", so you can use a gene GTF with type=="gene" or type=="transcript", and converting that to "exon" to exclude any intronic TE. Here is a way to do it:

$ awk 'BEGIN{FS="\t";OFS="\t"};$3=="gene"{$3="exon";print}' [Gene GTF] > [Gene model GTF]

We typically allow up to 100 alignments to the genome, but exclude things that map more than that (since those are likely low complexity sequences).

You will still see some signal with uniquely mapped reads, as there is a portion (up to 50%) that would align uniquely just to a TE (though that does include intronic TE). And yes, they will be biased towards those that have more alterations that make them more "mappable", and thus less likely to be young/active.

Thanks.

ddepierre commented 9 months ago

Ok thanks for your support!

It worked well, David