parklab / xTea

Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics
Other
99 stars 23 forks source link

POS, SVLEN, and END are seemingly discordant in VCF output #72

Closed RyanVidegar-Laird closed 1 year ago

RyanVidegar-Laird commented 1 year ago

Hi, thanks for all of your work on this tool!

I've ran xTea (with defaults) on 4 short-read WGS samples, and am a bit confused why the POS, END, and SVLEN values don't seem to align in the VCF output. I would expect END = POS + SVLEN, yet it doesn't across any of my samples for Alu or L1 SVs. Is this an error? I'm new to working with SVs, so perhaps it's my misunderstanding.

Small output example: awk '!/orphan/' ./xtea/out/sample-01_ALU.vcf | bcftools query -f'[%CHROM\t%POS\t%INFO/SVLEN\t%END\n]' - | shuf -n 5 | awk '{$5 = $4-$2}1' | column -t

CHR POS SVLEN END END-POS
chr6 49594200 269 49594216 16
chr9 96301414 276 96301428 14
chr8 19801418 292 19801427 9
chr2 64936633 274 64936649 16
chr8 127025867 378 127025879 12
simoncchu commented 1 year ago

This is insertion, which means they are absent from the genome. Here, SVLEN is the insertion length. In general, we think one insertion will have one breakpoint on the genome, however for TE insertions the two strands usually do not break at the exact same location (you can search for target-site-duplication in L1 retrotransposon to understand more). Thus, there are two breakpoints reported here (POS and END). Hope this helps.