open reading frames in each TE element

frankligy commented 1 month ago

Hello,

I've been a big fan of TElocal, especially for its clear definitions of complicated TE system. As someone from other fields, I always feel some sort of disconnection between mainstream TE review paper and the actual locations of the elements they are talking about. For example, there are well-defined "ORF1" and "ORF2" protein from LINE (https://www.nature.com/articles/nrc.2017.35/figures/1), but I never really understand the genomic locations for these two ORFs, do each TElocal-defined "TE transcript, or a duplicates in the name" of LINE element will have its own ORF1 and ORF2? Or the ORF1 and ORF2 is something globally considered when you concatenate all the LINE element across the whole genome?

I know it's not a issue for TElocal itself, but since I've always using TElocal, I just want to connect the dots between the actual bioinfo analysis with the mainstream TE terminologies.

Thanks a lot in advance, Frank

olivertam commented 1 month ago

Hi,

ORF1 and ORF2 are part of the LINE, so a "full-length" LINE would have the sequences for ORF1 and ORF2. However, LINE "replication" is not always perfect, and there's high occurrences of 5' truncation during the insertion process. As a result, not all LINE copies in the genome would contain ORF1 and ORF2 depending on the extent of truncation. Furthermore, mutations in the ORF sequence itself would also make the ORF non-functional (if they are not under (or under negative) selection pressure to maintain activity).

Long story short, each LINE copy (which we are calling a "transcript"/duplicate) could contain ORF1 and ORF2 if it is full-length and close to the consensus LINE sequence, but not all genomic copies will satisfy these criteria.

Please let me know if anything is unclear.

frankligy commented 1 month ago

Thanks very much for the clear explanation!

mhammell-laboratory / TElocal

open reading frames in each TE element #45