Finding as much as 68% of reads originating from TE elements?

aleighbrown commented 4 years ago

Hello,

Trying to reproduce Figure 1 from this on my own data https://link.springer.com/protocol/10.1007%2F978-1-4939-7710-9_11

And I've found something that seems somewhat..alarming making me think I've made an incorrect assumption somewhere about how the TECounts program is working.

I've run the TECounts program to get these counts files "sample.cntTable". For each sample there are 2 columns, gene/TE and a number. So combing samples we get a table like this for input into DESeq2

"

gene/TE	sample1	sample2
ENSG00000000457.14	16	26
ENSG00000000460.17	85	62
---
Zaphod2:hAT-Tip100:DNA	207	177
Zaphod3:hAT-Tip100:DNA	575	415

"

In order to count the reads which are coming from TE elements I'm simply using assumption that in the TEcounts output folder rows where the element start with ENS is a gene and otherwise is a TE.

te_table,te_element := ifelse(grepl("^ENS",gene.TE), "gene", "TE")]

Then doing simply a sum per sample of the counts from genes and the counts from TE I'm getting a range of 27% to 68% across my samples. Which seems way too high to make sense.

Are there double counts maybe? Eg a read that is overlapping a TE and a gene gets a count for both the gene and the TE?

olivertam commented 4 years ago

Hi,

For most of our samples of human cell/tissue, we find around 20-30% of reads to be originating from TE with the software. We haven't noticed too many RNAseq libraries where it's more than 50%, and there could be various factors that could contribute to it (maybe biological, maybe technical). Regarding double counting, this is something that the software accounts for. If the read is uniquely mapped, then it will be distributed to the gene if there is an overlap. If the read is ambiguously mapped, then it will be preferentially distributed to the TE (and thus should prevent double-counting). Without looking closer at the libraries (e.g. QC metrics, or types of "highly expressed TE", I cannot say whether the high TE contribution is biological or technical in origin. Please let me know if you have other questions, or if you want us to take a closer look.

Thanks.

olivertam commented 4 years ago

One of things that I would check is the counts for highly expressed "housekeeping" genes like GAPDH (ENSG00000111640) and ACTB (ENSG00000075624) and see what their counts are. They should be pretty well expressed (usually >15000 for a library with 20 million mapped reads). Another thing that I would double-check is what --stranded mode you are using. If your library is generated using the Illumina TruSeq kit, you will need the --stranded reverse parameter, otherwise you will be under-counting genes. Hopefully some of these suggestions might help.

Thanks.

aleighbrown commented 4 years ago

Thanks Oliver, looks like it was alright in terms of housekeeping gene etc, just expression was much higher than expected thanks for clarifying!

mhammell-laboratory / TEtranscripts

Finding as much as 68% of reads originating from TE elements? #56