parklab / xTea

Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics
Other
87 stars 19 forks source link

Very long length of L1 insertion #101

Closed sidi-yang closed 3 months ago

sidi-yang commented 3 months ago

Hi Simon,

This is the contents in the vcd file for Line1: chr2 20308826 . . <INS:ME:LINE1>. PASS SVTYPE=INS:ME:LINE1;SVLEN=29295108;END=20308826;TSD=NULL;TSDLEN=-1;SUBTYPE=orphan_or_sibling_transduction;TD_SRC=chr2:29301158-29301332;STRAND=+;AF=0.18571428571428572;LCLIP=25;RCLIP=0;LDISC=18;RDISC=29;LPOLYA=11;RPOLYA=0;LRAWCLIP=13;RRAWCLIP=13;AF_CLIP=13;AF_FMAP=57;AF_DISC=59;AF_CONCORDNT=186;LDRC=0;LDNRC=0;RDRC=0;RDNRC=0;LCOV=83.805;RCOV=144.205;LD_AKR_RC=0;LD_AKR_NRC=0;RD_AKR_RC=0;RD_AKR_NRC=0;LC_CLUSTER=-1:-1;RC_CLUSTER=-1:-1;LD_CLUSTER=29301158:29301158;RD_CLUSTER=29301332:29301332;NINDEL=0;CLIP_LEN=49:4:54:51:2:43:28:37:41:63:32:35:60;INS_INV=Not-5prime-inversion;REF_REP=not_in_LINE1_copy;GENE_INFO=intron:ENSG00000055917.15:PUM2 GT ./.

I'm confused about the length of the insertion, it's quite long(SVLEN=29295108) and without TSD information (and I think that's why I have POS and END at one breakpoint), besides that the subtype of this insertion is also confusing (SUBTYPE=orphan_or_sibling_transduction) .

My question is, the source is short (TD_SRC=chr2:29301158-29301332), but why the length of the insertion is 29295108 bp?

Just wondering could you please help me with my output? how could I explain this?

Thank you, Sidi

simoncchu commented 3 months ago

The length here should be an error. I also noticed somewhere before and will fix in the next release. This is triggered mainly because only one side signal is confirmed. Generally you can treat any insertion longer than 15k as wrong estimate. Note, this doesn't mean the insertion is wrong, just the length estimate is not accurate.