mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
213 stars 29 forks source link

full-length L1 #195

Closed BenxiaHu closed 2 weeks ago

BenxiaHu commented 1 month ago

Hello, thanks for develop these tools. I just have 1 questions: how to extract full-length L1 from the GTF you build (https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/)?

Best,

olivertam commented 1 month ago

Hi,

It's not trivial, as you would need to obtain the repeatMasker output from UCSC and parse through the repStart, repEnd and repLeft to determine which ones are full-length. In a simplistic model, you can look for L1 that have a repStart of 1, and a repEnd of around 6kb. However, this might differ from organism to organism and element to element, as we have found that previously published "active" L1Hs to have a repStart of ~125.

Thanks

BenxiaHu commented 1 month ago

thanks. is it possible just based on the length of L1 > 6kb to select the full-length L1? Best,

BenxiaHu commented 1 month ago

based on this paper: Genome-wide CRISPR–Cas9 screening for genes that control L1 expression in K562 cells Identification of L1s with 5′ UTR To identify L1s with 5′ UTR, the sequence of all 1,001,410 L1s from RepeatMasker annotation was first extracted and entered into the makeblastdb program from BLASTn (version 2.11.0) to generate a DNA database. L1 consensus ORF1 protein sequence was searched against the L1 DNA database using tBLASTn61. The L1s with ORF1-upstream sequence longer than 900 bp were regarded as L1s with 5′ UTR. Finally, a total number of 19,280 L1s were found and used in further analysis.

do you think this method makes sense to extract full-length L1?

olivertam commented 1 month ago

Hi,

You can certainly use the length of the L1 element in the GTF as a guide, though you have to be wary of truncated tandem concatenation (rather than intact full-length) contributing to the length. Yes, that is certainly a good method to identify full-length L1.

Thanks.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days