mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

All ERV genes have high reads in tumor #170

Closed frankligy closed 5 months ago

frankligy commented 6 months ago

Hello,

Thanks so much for developing this wonderful tool, we've been applying TEcount on a bunch of tumor samples focusing on quantifying the ERV gene expression. I expect a subset of ERV genes will be highly expressed whereas others won't (0 or very low counts) because ERV gene is usually not considered active in healthy tissue and can be selectively turned on in tumor due to epigenetic disruption.

But what turns out is, all the 594 ERV genes from your official TE gtf file, they all receive high amount of reads, I further normalized by sequencing depth and ERV gene accumulated length, the normalized value is still high. Although it is possible for the cancer we are looking at, ERV is just highly specific, but I just want to get your thoughts on how should we interpret the count result, after all, those reads were assigned by the EM algorithm. Is there any count cutoff you would recommend to distinguish "true" expressed ERV genes and other ERV genes that are not that abundant?

Thanks so much in advance, and happy to further clarify my question!

Best, Frank

olivertam commented 6 months ago

Hi Frank,

Thank you for your interest in the software.

This is a very interesting (and potentially confusing) aspect of TE "expression". The term "active" have largely been associated with either detection of TE subfamilies via qPCR/Northern, detection of ERV protein via antibody, or assessment of their retro-transposition activity. In that sense, it is probably correct that ERV is not "active" in most tissue in that they may not be generating functional proteins or performing retrotransposition.

However, it is still not abundantly clear (at least without in-depth long read sequencing) whether this truly reflects the overall transcriptional activity of TE genome-wide. What we have observed with our algorithm is that we are able to correlate differential expression of these TE across experimental conditions with orthogonal studies such as qPCR, and thus this is the primary use of the software.

TL;DR: We have observed your results, and would recommend performing differential analysis against "healthy" tissue (or whatever you want to use as a "reference") to determine if there are significant alterations in expression in tumor samples and potentially find ERV that are worth further investigations. I would be hesitate to put cutoffs as each ERV might have differing numbers of insertions (either active or inactive), and thus a defined cutoff might have unintentional effects (e.g. selecting for ERV with lots of copies in the genome that each express at low-ish level).

Hope this is somewhat helpful. Please let me know if you have further questions or need clarifications.

Thanks.

Thanks.

frankligy commented 6 months ago

Thanks very much @olivertam, that's extremely helpful!

Best, Frank

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days