oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
188 stars 40 forks source link

TSDs are not identical #174

Open ala98412 opened 3 months ago

ala98412 commented 3 months ago

Hi,

I am trying to determine the insertion target site of my LTR/Gypsy elements. Using 10:22687853..22691116_INT as an example, I assume that:

scaffold name = 10 start_position = 22687853 end_position = 22691116 intact = INT According to the manual, target site duplications (TSD) should be the same at both the 5' and 3' ends. However, in my genome, the 5' TSD and 3' TSD are not the same. I have noticed some patterns, but I believe they should be identical.

Did I misunderstand something? Sorry for my naive question.

This is my Python3 script: seq[start_position -1-5:start_position -1], seq[end_position:end_position +5]

Here is my result table:

5'TSD   3'TSD   Repeat_ID
TGACA   TGTAA   10:22687853..22691116_INT
TTACA   TGTCA   13:11754704..11757508_INT
TGACA   TGTAA   13:23094695..23096177_INT
TGACA   TGTAA   13:3475695..3477002_INT
TGACA   TGTAA   20:17957399..17960425_INT
CCACA   TGTAG   3:1605728..1611282_INT
CTACA   TGTGG   3:3264571..3270749_INT
TTACA   TGTGG   3:37194336..37198451_INT
TGACA   TGTCA   9:31434801..31436809_INT
TAACA   TGTAA   22:18312247..18314782_INT
TAACA   TGTCA   1:1966816..1970089_INT
TGACA   TGTAA   1:1249479..1253739_INT
TGACA   TGTCA   4:30995067..30999235_INT
TGACA   TGTCA   13:3164753..3167290_INT
CCACA   TGTAG   8:12600685..12607139_INT
TTACA   TGTCA   8:16012374..16016235_INT
CTACA   TGTAA   9:23014621..23021192_INT
GTACA   TGTGG   8:7106864..7111228_INT
AAACA   TGTTA   2:37860875..37866860_INT
TTACA   TGTCA   18:28467404..28470784_INT
TAACA   TGTCA   23:7707104..7709903_INT
GTACA   TGTGG   15:14569922..14574417_INT
ATACA   TGTTA   9:23918263..23924418_INT
TGACA   TGTCA   19:22608920..22612458_INT
CCACA   TGTAG   5:22942199..22948802_INT
TTACA   TGTAT   24:9397720..9401585_INT
CTTCA   TGTTG   3:40824394..40831408_INT
GTACA   TGTAG   6:34329395..34335570_INT
AAACA   TGTTA   2:37841501..37847533_INT
TTACA   TGTAA   12:17059040..17065533_INT
TAACA   TGTGA   1:21012603..21019343_INT
CAACA   TGTGG   3:2895580..2901609_INT
TGACA   TGTCA   17:26914882..26918649_INT
CTACA   TGTGG   17:1316150..1322707_INT
TGACA   TGTCA   5:3943017..3947283_INT
GCACA   TGTAT   13:30604096..30607812_INT
TGACA   TGTAA   7:19929823..19931152_INT
CTACA   TGTAA   7:9337970..9344181_INT
TTACA   TGTCA   1:32688615..32692545_INT
TTACA   TGTAG   5:764594..771458_INT
TTACA   ACTGT   16:22850998..22854184_INT
TTACA   TGTTA   2:15229994..15233935_INT
TTACA   TGTAA   10:12217720..12223558_INT
CTACA   AAGAT   4:6220653..6227592_INT
ACACA   TGTTA   10:5580976..5585395_INT
TGACA   TGTCA   3:7444737..7448764_INT
TGACA   TGTTA   2:12416092..12419661_INT
GCATG   AAAAA   23:8880995..8883358_INT
TGACA   TGTCA   10:12631883..12635627_INT
GCACA   TGTAC   8:7133855..7139321_INT
GGACA   TGTAG   20:22927000..22934272_INT
TTACA   TGTAA   17:2738792..2743075_INT
TTACA   TGTAA   8:12102948..12109802_INT
TAACA   TGTAA   8:17981899..17985045_INT
CTACA   TGTAA   7:11248909..11251237_INT
TTACA   TGTAA   6:20537116..20543571_INT
TTACA   TGTAG   17:14704587..14710662_INT
TGCCA   TGTTT   3:174059..179427_INT
TTACA   AGTGT   7:24423067..24429244_INT
TGACA   TGTCA   8:16576498..16579370_INT
CCACA   TGTAG   4:17854264..17857441_INT
TGACA   TGTCA   3:38208553..38212350_INT
CAACA   TGTCG   7:631159..636493_INT
ATCCA   TGTTA   1:1123017..1129006_INT
AGCCA   TGTAA   3:26336380..26348605_INT
TTACA   TGTCA   11:9892059..9893863_INT
CTACA   TGTGA   3:44276676..44282169_INT
CAACA   TGTAG   4:32696919..32701358_INT
TTACA   TGTCA   19:24017563..24019609_INT
TGACA   TGTCA   13:3224365..3227310_INT
TAACA   TGTCA   11:29824664..29826386_INT
CTACA   TGTGA   16:14646338..14652607_INT
TAACA   TGTCA   20:9902191..9906967_INT
CTACA   TGTTG   3:3130423..3134376_INT
TCACA   TGTAA   2:17612423..17618953_INT
TAACA   TGTCA   4:31576990..31580084_INT
TTACA   TGTGA   14:15475907..15481394_INT
TTACA   TGTTA   18:18699994..18703723_INT
TTACA   TGTAA   7:21620972..21626540_INT
TAACA   TGTCA   21:20045206..20048733_INT
TTACA   TGTAA   9:22109920..22116800_INT
TTACA   TGTGA   3:8418439..8424555_INT
CTACA   TGTTG   13:3964337..3967749_INT
TAACA   TGTCA   21:7028931..7030389_INT
TCACA   TGTAA   22:13751478..13754925_INT
TTACA   TGTAT   12:16121309..16125551_INT
TAACA   TTTGT   2:44761746..44767043_INT
CTACA   TGTAG   7:21033025..21039338_INT
TTACA   TGTGG   1:46125613..46129869_INT
CGACA   TGTTG   20:4090440..4099600_INT
TTACA   TGTAG   5:1168753..1174941_INT
TGACA   TGTAA   5:3743494..3747596_INT
TGACA   TGTAA   13:16891538..16894937_INT
TTACA   TGTGG   20:7399794..7403248_INT
TTACA   TGTAA   10:14750388..14756880_INT
TTACA   TGTAC   6:11915047..11920736_INT
TAACA   TGTTA   2:19450898..19452802_INT
TAACA   TGTAA   13:26158929..26163373_INT
CCACA   TGTAG   5:23026949..23031309_INT
CTACA   TGTAA   24:2312688..2318290_INT
CTTCA   TGTTG   2:11742344..11751068_INT
TAACA   TGTCA   8:15321620..15325915_INT
CAACA   TGAAG   6:13246814..13254187_INT
TGACA   TGTAA   5:23115747..23119460_INT
CTACA   TGTAG   15:14781989..14788545_INT
CCACA   TGTAG   6:32507841..32512069_INT
CAACA   TGTTA   13:23069820..23074981_INT
TCACA   TGTAG   15:14833322..14837442_INT
CAACA   TGTAG   3:3059958..3064368_INT
TTACA   TGTAA   6:11708287..11712617_INT
TGACA   TGTAA   12:19731181..19732808_INT
TCACA   TGTAA   11:13889865..13896492_INT
TTACA   TGTAA   14:15100784..15107507_INT
TGACA   TGTAA   4:16952220..16954047_INT

Best, Jui-Hung

oushujun commented 1 month ago

Hi Jui-Hung,

Sorry for the delayed response. "_INT" sequences are internal sequences of LTR retrotransposons. TSDs are found flanking LTR elements, so there won't be TSDs flanking "_INT" sequences. You will see "_INT" are flanked by CA and TG dinucleotides, becasuse they are motifs of the LTR regions. An intact LTR element looks like: TSD-TG...(LTR)...CA----INT----TG...(LTR)...CA-TSD.

Let me know if you have more questions! Shujun

ala98412 commented 1 month ago

Hi Oushujun,

Thank you for your reply.

I’m interested in studying target sites, similar to the research in this study (https://academic.oup.com/plcell/article/15/8/1771/6010085). Could I directly use the TSD positions from the pass list to search for patterns in the adjacent sequences?

Thank you.

Best, Jui-Hung

oushujun commented 1 month ago

Yes, of course!

Shujun

On Mon, Aug 26, 2024 at 8:21 AM ala98412 @.***> wrote:

Hi Oushujun,

Thank you for your reply.

I’m interested in studying target sites, similar to the research in this study (https://academic.oup.com/plcell/article/15/8/1771/6010085). Could I directly use the TSD positions from the pass list to search for patterns in the adjacent sequences?

Thank you.

Best, Jui-Hung

— Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/174#issuecomment-2310075799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NBZYCOXCFQD5BSMBRDZTMMULAVCNFSM6AAAAABKQMZIVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGA3TKNZZHE . You are receiving this because you commented.Message ID: @.***>