parklab / xTea

Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics
Other
99 stars 23 forks source link

output vcf #77

Closed Sun-Wu-Kong-666 closed 1 year ago

Sun-Wu-Kong-666 commented 1 year ago

this is the context in vcf: chr6 47345039 . A . PASS SVTYPE=INS:ME:ALU;SVLEN=402;END=47345049;TSD=+AAAGAGCCCC;TSDLEN=10;SUBTYPE=one_half_side_tprt_both;TD_SRC=not_transduction;STRAND=+;AF=1.0;LCLIP=2;RCLIP=1;LDISC=0;RDISC=2;LPOLYA=2;RPOLYA=0;LRAWCLIP=2;RRAWCLIP=2;AF_CLIP=2;AF_FMAP=0;AF_DISC=3;AF_CONCORDNT=0;LDRC=0;LDNRC=0;RDRC=2;RDNRC=0;LCOV=1.91;RCOV=1.09;LD_AKR_RC=0;LD_AKR_NRC=0;RD_AKR_RC=2;RD_AKR_NRC=0;LC_CLUSTER=460:460;RC_CLUSTER=58:58;LD_CLUSTER=-1:-1;RD_CLUSTER=262.0:273.0;NINDEL=0;CLIP_LEN=6:58;INS_INV=Not-5prime-inversion;REF_REP=not_in_Alu_copy;GENE_INFO=exon:ENST00000663906.1:3:ENSG00000287485 GT ./.

1、this ALU insertion is 402 bp,since the ALU elements is a long sequence, how can I know the precise ALU sequence which inserted in this position (chr6:47345039)? 2、the ALU element can be classified into many subfamilies shown below, and in your nature communication paper fig2A-C, the ALU/L1/SVA insertions can be classified into subfamiles in detail, but the vcf does not show the specifc subfamily that the insert belong to, so how can I know the exact subfamily which insert in this position ?

Accession | Name | Classification | Clades | Description | Length DF0000002 | AluY | Alu | Primates | AluY subfamily | 311 DF0000003 | AluSc | Alu | Primates | AluSc subfamily | 309 DF0000007 | AluJb | Alu | Primates | AluJb subfamily | 312 DF0000034 | AluJo | Alu | Primates | AluJo subfamily | 312 DF0000035 | AluJr | Alu | Primates | AluJr subfamily | 312 DF0000036 | AluJr4 | Alu | Primates | AluJr4 subfamily | 312 DF0000037 | AluSc5 | Alu | Primates | AluSc5 subfamily | 309 DF0000038 | AluSc8 | Alu | Primates | AluSc8 subfamily | 311 DF0000039 | AluSg | Alu | Primates | AluSg subfamily | 310 DF0000040 | AluSg4 | Alu | Primates | AluSg4 subfamily | 310 DF0000041 | AluSg7 | Alu | Primates | AluSg7 subfamily | 309 DF0000042 | AluSp | Alu | Primates | AluSp subfamily | 313 DF0000043 | AluSq | Alu | Primates | AluSq subfamily | 313 DF0000044 | AluSq10 | Alu | Primates | AluSq10 subfamily | 313 DF0000045 | AluSq2 | Alu | Primates | AluSq2 subfamily | 313 DF0000046 | AluSq4 | Alu | Primates | AluSq4 subfamily | 311 DF0000047 | AluSx | Alu | Primates | AluSx subfamily | 312 DF0000048 | AluSx1 | Alu | Primates | AluSx1 subfamily | 312 DF0000049 | AluSx3 | Alu | Primates | AluSx3 subfamily | 311 DF0000050 | AluSx4 | Alu | Primates | AluSx4 subfamily | 310 DF0000051 | AluSz | Alu | Primates | AluSz subfamily | 312 DF0000052 | AluSz6 | Alu | Primates | AluSz6 subfamily | 312 DF0000053 | AluYa5 | Alu | Hominidae | AluYa5 subfamily | 311 DF0000054 | AluYa8 | Alu | Hominidae | AluYa8 subfamily | 310 DF0000055 | AluYb8 | Alu | Hominidae | AluYb8 subfamily | 318 DF0000056 | AluYb9 | Alu | Hominidae | AluYb9 subfamily | 318 DF0000057 | AluYc | Alu | Primates | AluYc subfamily | 299 DF0000058 | AluYc3 | Alu | Primates | AluYc3 subfamily | 300 DF0000060 | AluYd8 | Alu | Hominidae | AluYd8 subfamily | 299 DF0000063 | AluYh9 | Alu | Hominidae | AluYh9 subfamily | 311 DF0000064 | AluYk11 | Alu | Hominidae | AluYk11 subfamily | 311 DF0000065 | AluYk12 | Alu | Hominidae | AluYk12 subfamily | 311 DF0000066 | AluYk4 | Alu | Hominidae | AluYk4 subfamily | 311 DF0000634 | AluYg6 | Alu | Hominidae | AluYg6 subfamily | 311 DF0001145 | AluYk3 | Alu | Hominoidea | AluYk3 subfamily | 311 DF0001154 | AluYm1 | Alu | Hominoidea | AluYm1 subfamily | 311 DF0001169 | AluYk2 | Alu | Hominoidea | AluYk2 subfamily | 311 DF0001174 | AluYe6 | Alu | Primates | AluYe6 subfamily | 310 DF0001197 | AluYi6 | Alu | Hominidae | AluYi6 subfamily | 311 DF0001240 | AluYe5 | Alu | Primates | AluYe5 subfamily | 310 DF0001316 | AluYi6_4d | Alu | Homo sapiens | AluYi6_4d subfamily | 311 DF0001317 | AluYf1 | Alu | Hominoidea | AluYf1 subfamily | 311 DF0001318 | AluYh3 | Alu | Hominidae | AluYh3 subfamily | 311 DF0001320 | AluYj4 | Alu | Homo sapiens | AluYj4 subfamily | 311 DF0001321 | AluYh7 | Alu | Homo sapiens | AluYh7 subfamily | 311 DF0290022 | Walusat | Unknown | Homo sapiens | Repetitive element found in the T2T assembly of the CHM13 human genome | 64

simoncchu commented 1 year ago

For your questions:

1、this ALU insertion is 402 bp,since the ALU elements is a long sequence, how can I know the precise ALU sequence which inserted in this position (chr6:47345039)?

In theory, we could run a local assembly to construct the two tail sides of an insertion, but this function is not exported in the current release. I'll add it in the next release.

2、the ALU element can be classified into many subfamilies shown below, and in your nature communication paper fig2A-C, the ALU/L1/SVA insertions can be classified into subfamiles in detail, but the vcf does not show the specifc subfamily that the insert belong to, so how can I know the exact subfamily which insert in this position ?

The subfamily annotation in fig2A-C is from the benchmark where the insertion sequences were extracted from long reads, not from short reads. Limited to the insert-size of the short reads, most of the time we cannot construct the integrated insertion sequence from short reads.

blaverty commented 3 months ago

Hi! Wondering if you added the function to output the insertion sequence? Thanks!