monarch-initiative / SvAnna

Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing
32 stars 4 forks source link

feature request: processing of vcs from Delly #243

Open jessmewald opened 7 months ago

jessmewald commented 7 months ago

Hi there,

We would like to process vcf outputs from the caller Delly with SvAnna, if possible. Below is a subset of the errors we encounter:

    _____       ___
   / ___/_   __/   |  ____  ____  ____ _
   \__ \| | / / /| | / __ \/ __ \/ __ `/
  ___/ /| |/ / ___ |/ / / / / / / /_/ /
 /____/ |___/_/  |_/_/ /_/_/ /_/\__,_/

 Structural Variant Annotation and Analysis
                           :: v1.0.5-SNAPSHOT ::

 16:06:45.640 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Using 4 phenotype features supplied via CLI
 16:06:45.645 [main] INFO  o.m.svanna.cli.cmd.SvAnnaCommand - Spooling up SvAnna v1.0.5-SNAPSHOT using resources in /svanna_db_2304_hg38
 16:06:54.489 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Reading variants from `NA12878_hg38_pbmm2_delly.vcf.gz`
 16:06:54.584 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-10991221:(DUP00000246)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:10991222-10994549 -><DUP>
 16:06:54.605 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-33051198:(DUP00000472)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:33051199-64384519 -><DUP>
 16:06:54.606 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-34635820:(DEL00000486)`: Illegal DEL changeLength:0. Should be < 0 given coordinates  1:34635821-34646375 -><DEL>
 16:06:54.621 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-72300640:(DEL00000752)`: Illegal DEL changeLength:0. Should be < 0 given coordinates  1:72300641-72346156 -><DEL>
 16:06:54.622 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-73129298:(DUP00000758)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:73129299-155634876 -><DUP>
 16:06:54.622 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-73130124:(DEL00000759)`: Illegal DEL changeLength:0. Should be < 0 given coordinates  1:73130125-155627485 -><DEL>
 16:06:54.642 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143184588:(DUP00001195)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:143184589-143202283 -><DUP>
 16:06:54.642 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143184588:(DUP00001196)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:143184589-143207373 -><DUP>
 16:06:54.643 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143184612:(DUP00001197)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:143184613-143200532 -><DUP>
 16:06:54.643 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143184612:(DUP00001199)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:143184613-143221923 -><DUP>
 16:06:54.643 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143184612:(DUP00001200)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:143184613-143241846 -><DUP>
 16:06:54.645 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143191627:(DEL00001254)`: Illegal DEL changeLength:0. Should be < 0 given coordinates  1:143191628-143211409 -><DEL>
 16:06:54.648 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143200193:(DUP00001327)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:143200194-143203550 -><DUP>
 16:06:54.650 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid variant `chr1-143206014:(DUP00001359)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:143206015-143208676 -><DUP>
 .....
 16:06:55.589 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Read 32,785 variants
 16:06:55.589 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants with reciprocal overlap >80.0% occurring in more than 1.0% probands
 16:06:55.589 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants where ALT allele is supported by less than 3 reads
 16:07:18.997 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritizing 32,785 variants on 2 threads
 16:07:19.017 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.017 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.017 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.017 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.372 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.372 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.373 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.373 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.373 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.484 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 2
 16:07:19.484 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 2
 16:07:19.484 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 2
 16:07:19.484 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 2
 16:07:19.485 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 2
 16:07:19.485 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 2
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 16:07:19.519 [svanna-worker-2] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 1
 ...
 16:07:30.831 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritization finished in 0m 11s (11,833 ms) processing on average 2,770.64 items/s
 16:07:30.832 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Writing out the results
 16:07:30.838 [main] INFO  o.m.s.cli.writer.vcf.VcfResultWriter - Writing VCF results into NA12878_hg38_delly_svanna.vcf.gz
 16:07:34.893 [main] INFO  o.m.s.c.w.t.TabularResultWriter - Writing tabular results into NA12878_hg38_delly_svanna.csv.gz
 16:07:35.677 [main] INFO  o.m.s.c.writer.html.HtmlResultWriter - Writing HTML results to NA12878_hg38_delly_svanna.html
 16:07:41.997 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - We're done, bye!

And the header + a few lines of calls from Delly are below:

 ##fileformat=VCFv4.2
 ##FILTER=<ID=PASS,Description="All filters passed">
 ##fileDate=20231201
 ##ALT=<ID=DEL,Description="Deletion">
 ##ALT=<ID=DUP,Description="Duplication">
 ##ALT=<ID=INV,Description="Inversion">
 ##ALT=<ID=BND,Description="Translocation">
 ##ALT=<ID=INS,Description="Insertion">
 ##FILTER=<ID=LowQual,Description="Poor quality and insufficient number of PEs and SRs.">
 ##INFO=<ID=CIEND,Number=2,Type=Integer,Description="PE confidence interval around END">
 ##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="PE confidence interval around POS">
 ##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for POS2 coordinate in case of an inter-chromosomal translocation">
 ##INFO=<ID=POS2,Number=1,Type=Integer,Description="Genomic position for CHR2 in case of an inter-chromosomal translocation">
 ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
 ##INFO=<ID=PE,Number=1,Type=Integer,Description="Paired-end support of the structural variant">
 ##INFO=<ID=MAPQ,Number=1,Type=Integer,Description="Median mapping quality of paired-ends">
 ##INFO=<ID=SRMAPQ,Number=1,Type=Integer,Description="Median mapping quality of split-reads">
 ##INFO=<ID=SR,Number=1,Type=Integer,Description="Split-read support">
 ##INFO=<ID=SRQ,Number=1,Type=Float,Description="Split-read consensus alignment quality">
 ##INFO=<ID=CONSENSUS,Number=1,Type=String,Description="Split-read consensus sequence">
 ##INFO=<ID=CONSBP,Number=1,Type=Integer,Description="Consensus SV breakpoint position">
 ##INFO=<ID=CE,Number=1,Type=Float,Description="Consensus sequence entropy">
 ##INFO=<ID=CT,Number=1,Type=String,Description="Paired-end signature induced connection type">
 ##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Insertion length for SVTYPE=INS.">
 ##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
 ##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation">
 ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
 ##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Type of approach used to detect SV">
 ##INFO=<ID=INSLEN,Number=1,Type=Integer,Description="Predicted length of the insertion">
 ##INFO=<ID=HOMLEN,Number=1,Type=Integer,Description="Predicted microhomology length using a max. edit distance of 2">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Log10-scaled genotype likelihoods for RR,RA,AA genotypes">
 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
 ##FORMAT=<ID=FT,Number=1,Type=String,Description="Per-sample genotype filter">
 ##FORMAT=<ID=RC,Number=1,Type=Integer,Description="Raw high-quality read counts or base counts for the SV">
 ##FORMAT=<ID=RCL,Number=1,Type=Integer,Description="Raw high-quality read counts or base counts for the left control region">
 ##FORMAT=<ID=RCR,Number=1,Type=Integer,Description="Raw high-quality read counts or base counts for the right control region">
 ##FORMAT=<ID=RDCN,Number=1,Type=Integer,Description="Read-depth based copy-number estimate for autosomal sites">
 ##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference pairs">
 ##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant pairs">
 ##FORMAT=<ID=RR,Number=1,Type=Integer,Description="# high-quality reference junction reads">
 ##FORMAT=<ID=RV,Number=1,Type=Integer,Description="# high-quality variant junction reads">
 ##reference=/projects/clia/clia-LRS/hg38_noalt/hg38.no_alt.fa
 ##contig=<ID=chr1,length=248956422>
 ##contig=<ID=chr10,length=133797422>
 ##contig=<ID=chr11,length=135086622>
 ##contig=<ID=chr11_KI270721v1_random,length=100316>
 ##contig=<ID=chr12,length=133275309>
 ##contig=<ID=chr13,length=114364328>
 ##contig=<ID=chr14,length=107043718>
 ##contig=<ID=chr14_GL000009v2_random,length=201709>
 ##contig=<ID=chr14_GL000225v1_random,length=211173>
 ##contig=<ID=chr14_KI270722v1_random,length=194050>
 ##contig=<ID=chr14_GL000194v1_random,length=191469>
 ##contig=<ID=chr14_KI270723v1_random,length=38115>
 ##contig=<ID=chr14_KI270724v1_random,length=39555>
 ##contig=<ID=chr14_KI270725v1_random,length=172810>
 ##contig=<ID=chr14_KI270726v1_random,length=43739>
 ##contig=<ID=chr15,length=101991189>
 ##contig=<ID=chr15_KI270727v1_random,length=448248>
 ##contig=<ID=chr16,length=90338345>
 ##contig=<ID=chr16_KI270728v1_random,length=1872759>
 ##contig=<ID=chr17,length=83257441>
 ##contig=<ID=chr17_GL000205v2_random,length=185591>
 ##contig=<ID=chr17_KI270729v1_random,length=280839>
 ##contig=<ID=chr17_KI270730v1_random,length=112551>
 ##contig=<ID=chr18,length=80373285>
 ##contig=<ID=chr19,length=58617616>
 ##contig=<ID=chr1_KI270706v1_random,length=175055>
 ##contig=<ID=chr1_KI270707v1_random,length=32032>
 ##contig=<ID=chr1_KI270708v1_random,length=127682>
 ##contig=<ID=chr1_KI270709v1_random,length=66860>
 ##contig=<ID=chr1_KI270710v1_random,length=40176>
 ##contig=<ID=chr1_KI270711v1_random,length=42210>
 ##contig=<ID=chr1_KI270712v1_random,length=176043>
 ##contig=<ID=chr1_KI270713v1_random,length=40745>
 ##contig=<ID=chr1_KI270714v1_random,length=41717>
 ##contig=<ID=chr2,length=242193529>
 ##contig=<ID=chr20,length=64444167>
 ##contig=<ID=chr21,length=46709983>
 ##contig=<ID=chr22,length=50818468>
 ##contig=<ID=chr22_KI270731v1_random,length=150754>
 ##contig=<ID=chr22_KI270732v1_random,length=41543>
 ##contig=<ID=chr22_KI270733v1_random,length=179772>
 ##contig=<ID=chr22_KI270734v1_random,length=165050>
 ##contig=<ID=chr22_KI270735v1_random,length=42811>
 ##contig=<ID=chr22_KI270736v1_random,length=181920>
 ##contig=<ID=chr22_KI270737v1_random,length=103838>
 ##contig=<ID=chr22_KI270738v1_random,length=99375>
 ##contig=<ID=chr22_KI270739v1_random,length=73985>
 ##contig=<ID=chr2_KI270715v1_random,length=161471>
 ##contig=<ID=chr2_KI270716v1_random,length=153799>
 ##contig=<ID=chr3,length=198295559>
 ##contig=<ID=chr3_GL000221v1_random,length=155397>
 ##contig=<ID=chr4,length=190214555>
 ##contig=<ID=chr4_GL000008v2_random,length=209709>
 ##contig=<ID=chr5,length=181538259>
 ##contig=<ID=chr5_GL000208v1_random,length=92689>
 ##contig=<ID=chr6,length=170805979>
 ##contig=<ID=chr7,length=159345973>
 ##contig=<ID=chr8,length=145138636>
 ##contig=<ID=chr9,length=138394717>
 ##contig=<ID=chr9_KI270717v1_random,length=40062>
 ##contig=<ID=chr9_KI270718v1_random,length=38054>
 ##contig=<ID=chr9_KI270719v1_random,length=176845>
 ##contig=<ID=chr9_KI270720v1_random,length=39050>
 ##contig=<ID=chrM,length=16569>
 ##contig=<ID=chrX,length=156040895>
 ##contig=<ID=chrY,length=57227415>
 ##contig=<ID=chrY_KI270740v1_random,length=37240>
 ##bcftools_viewVersion=1.9+htslib-1.9
 ##bcftools_viewCommand=view NA12878_hg38_pbmm2_delly.bcf; Date=Fri Dec  1 20:46:30 2023
 ##bcftools_viewCommand=view NA12878_hg38_pbmm2_delly.vcf.gz; Date=Mon Feb 26 16:20:51 2024
 #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  UnnamedSample
 chr1   10862   INS00000000 G   GCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCATGCTAGCGCGTCCAGGGGAGGAGGCGTGGCA   67  LowQual PRECISE;SVTYPE=INS;SVMETHOD=EMBL.DELLYv1.1.7;END=10862;SVLEN=68;PE=0;MAPQ=0;CT=NtoN;CIPOS=-13,13;CIEND=-13,13;SRMAPQ=16;INSLEN=68;HOMLEN=15;SR=4;SRQ=0.931035;CONSENSUS=CTAACCCGAACCCGAACCCGAACCCGAACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCAACCCCTAACCCTAACCCTAACCCTAACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCTCACCCCAACCCCCACCCCCACCCCCACCCTCAACCCTCAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCTCTACCCCCACCCCCACCCCCACCCCCACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCATGCTAGCGCGTCCAGGGGAGGAGGCGTGGCACAGGCGCAGAGACACATGCTAGCGCGCCCAGGGGAGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGTGCTGCCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGCACGCCTACGGGCGGGGTTGGGGGGGGCGTGTGTTACAGGAGCAAAGTCGCACGGCGCCGGGCTGGGGGCGGGGGGGGGGGGGCGCCGTGCACGCGCAGAAACTCACGTCACGGCGGCGCGGCGCAGAGACGGGTGGAACCTCAGTAATCCGAAACGCCGGGATCGACAGCCCCTTGCTTGCAGCCGGGCACTACAGGACCCGCTTGCTCACGGTGCTGTGCCAGGGCGCCCCCTGCTGGCGACTAGGGCAACTGCAGGGCTCTCTTGCTTAGAGTGGTGGCCAGCGCCCCCTGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGTGGGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCTCTTGCTCAAGGAGTAGTGGCAGCACGCCCGCCTGCTGGCAGCTGGGGACACTGCCGGGCCCTCTTGCTCCAACAGTAGTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTCTCAGTAGGCTCCTAAATATGGGATTCCTGGGTTTAAAAGTATAAAATAAATATGTTTAATTTGTTAACTGATTACCATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTATTTTCGTTAACTTGCCGTCAGC;CE=1.94211;CONSBP=969   GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 0/1:-5.49588,0,-0.197838:4:LowQual:10342:15644:5302:2:0:0:0:4
 chr1   43177   DEL00000001 TAAATATGAAGAATATTATAAATCATATCAATAACCACAACATTCAAGCTGTCAGTTTGAATAGACAATGTAAATGACAAAACTACATACTCAACAAGATAACAGCAAACCAGCTTCGACAGCACGTTAAAGGGGTCATACAACATAATCGAGTAGAATTTATCTCTGAGATGCAAGAATGGTTCAAAATATGGAAACCAATAAATGTGATATGCCACACTAACAGAATAAAAAATAAAAATCATATTATCATCTCAATAGATGCAGAAAAAGCATTAACAAAAGTAAAC  T   48  LowQual PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv1.1.7;END=43466;PE=0;MAPQ=0;CT=3to5;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=12;INSLEN=0;HOMLEN=1;SR=4;SRQ=0.978193;CONSENSUS=GCTGAATTACCCATGCAAAACCTTAATACTTGACACTTATCACTACTTTATTCAAGAGCCTATTGTTCTCAACTCTGCTCATTAATACTATGCTTGGAGTATACAGTAAGATAAGAAACATAAATAAGAAGTGTACATTTGTTTCTTCCTGTTTTCTTCTGGCTATTGGATCAATTACATCCCATCTTAAGCTGACCCCTGTGTAATTAATCAATATCCGTTTTAAGCAGCAATCCATAGTTGTGCAGAAATTAGAAAACTGACCCACACAGAAAAACTAATTGTGAGAACCAATATTACACTAAATTCATTTGACAATTCTCAGCAAAGTGCTGGGTTGATCTCTATTTATGCTTTTCTTAAACACACAAAATACAAAAGTTAACCCATATGGAATGCAATGGAGGAAATCAATGACATATCAGATCTAGAAACTAATCAATTAGCAATCAGGAAGGAGTTGCGGTAGGAAGTCTGTGCTGTTGAATGTACACTAATCAATGATTCCTTAAATTATTCACAATAAAAAAAAAGATTAGAATAGTTTTTTTTTAAAAAAAGCCCAGAAACTAATCTAAGTTTTGTCTGGTAATAAAGGTATATTTTCAAAAGAGAGGTAAATAGATCCACATACTGTGGAGGGAATAAAATACTTTTTGAAAAACAAACAACAAGTTGGATTTTTAGACACATAGAAATTGAATATGTACATTTATAAATATTTTTGGATTGAACTATTTCAAAATTATACCATAAAATAACTTGTAAAAATGTAGGCAAAATGTATATAATTATGGCATGAGGTATGCAACTTTAAGCAAGGAAGCAAAAGCAGAAACCATGAAAAAAGTCTAAATTTTACCATATTGAATTTAAATTTTCAAAAACAAAAATAAAGACAAAGTGGGAAAAATATGTATGCTTCATGTGTGACAAGCCACTGATATTTTATTCTTTCATAATAAGACATCAGATAAAACAAATTAGGAATAGAAGGAATGTACCGCAACACAATAAAGGCCATATATAACAAGCCCACAGCTAACATCATAATAGTAAAATCATCACAATGGTAAAAAAAATGAAAGCTTTTCCTCTAAGGTCAGAAATAATATAAAGGTTCCCACTCTTGCTATTTCTATTCCATATAGTACTAAAAGTCCTAGCCAGGACAATTAGACAAAATAAAAATAAAAACACCCAAATTGGAAAGATAGAAGCAAACTTTTCTGTTTACAGATAACATAATCTTATATGTAGAAACCCCTTAAAACTTCAGCAAAAAAAAAAAAAAAAAAAAACTACAGAGCTAGTAAATTCAGTGAAGTTGCAGAATACAAAATCAACATACAAAAATCAGTAGTGTCTCTATACACTAATAAGGACTTAACAGAGAAAGAAGTTAAGAAAACAATACCACTAACAATAGAATCCAAAAAATAAAATACTTAGGAATAAATTTTACCAAACATCTGTACACTAAAAACTATAAAACATTGAAAAAAGAAGTTGAATAAGACACATATAAATAGAAAGCTATCTCATGTTAATAGATTAGAAAAAGTAATATTGTTAAGATGTCCTCACTACTTAAAGCAATTTATAGATCTAATGCATTTATTGCAATCTCTTCAAAATCCCAAAGGTATTTTTGACAGAAATAAAAAAAAAATTCTAAAATATGCATGAAACCACAAAAGACTGTGAATAGCTAAAGCAATCTTGAGCAAGATGAACAACACTGGAAGCATCACACTACCTTATTTCAAAATCTACTACAAAGCTATAGTGATCAAAGCAACATGATACTGTCATAAAAACACATAGATAAACCTATGGAATGGAATAAAGAGCACAGAAATAAGTCCACACATTTACATTCAATTGATTTTCAACAAC;CE=1.84833;CONSBP=950  GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 1/1:-2.11242,-0.542892,0:6:LowQual:1006:872:1013:1:0:0:2:2
 chr1   54712   INS00000002 T   TTTTTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTC  14  LowQual PRECISE;SVTYPE=INS;SVMETHOD=EMBL.DELLYv1.1.7;END=54712;SVLEN=53;PE=0;MAPQ=0;CT=NtoN;CIPOS=-7,7;CIEND=-7,7;SRMAPQ=1;INSLEN=53;HOMLEN=8;SR=10;SRQ=0.987275;CONSENSUS=TAAATAAAATGTGAACTTAGGCAAATTATAAATTAATAAAGTATATTTTTAAAATTTCCATTTTAATTTCTGTTTAAATTAGAATAAGAAACAAAAACAACTATGTAATACGTGTGCAAAGCCCTGAACTGAGATTTGACTTTACCTTGAGCTTTGTCAGTTTACGATGCTATTTCAGTTTTGTGCTCAGATTTGAGTGATTGCAGGAAGAGAATAAATTTCTTTAATGCTGTCAAGACTTTAAATAGATACAGACAGAGCATTTTCACTTTTTCCTGCATCTCTATTATTCTAAAAATGAGAACATTCCAAAAGTCAATCATCCAAGTTTATTCTAAATAGATGTGTAGAAATAACAGTTGTTTCACAGGAGACTAATCGCCCAAGGATATGTGTTTAGAGGTACTGGTTTCTTAAATAAGGTTTTCTAGTCAGGCAAAAGATTCCCTGGAGCTTATGCATCTGTGGTTGATATTTTGGGATAAGAATAAAGCTAGAAATGGTGAGGCATATTCAATTTCATTGAAGATTTCTGCATTCAAAATAAAAACTCTATTGAAGTTACACATACTTTTTTCATGTATTTGTTTCTACTGCTTTGTAAATTATAACAGCTCAATTAAGAGAAACCGTACCTATGCTGTTTTGTCCTGTGACTCTCCAAGAACCTTCCTAAGTTATTCTACTTAATTGCTTTATCACTCATATGAATGGGAATTTCTTCTCTTAATTGCTGCTAATCTCCCCCATCTTCAAATACTCTACCGGGCTTCTGGAACACCACAGCTTCCTGGCTTTTTCTCCTACCTCCTGGGCAAGTCCTTCCCTGTGTCTTTTGTTGAGTGTTCCTCATCTGCTTAACCACCAATCAACCTATTGCCCCTAATTTGATCTTTGGCCTGTTTTCACTTAGATTCTATCCCTACGTATCACCCATTCCCACAGCTTTAATTACCATCTAAACACTAGGGGCTCTCAAACCTTCTATTTTTTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTCTTCCTCCTTTTCTTTCCTTTTCTTTCTTTCATTCTTTCTTTCTTTTTTAAGGGGCAGGGTCTCACTATGTTACTGAGGCTGGTCTCAAACTCCTGACCTCAAGCAATCTGTCTGCTTCAGCCTCCCAAGTAGCTGAGAATACAGGGACAAGCCATTGCACCTGACCCTGGTACTATTTCTTGAGTTCCTGATCCACAGATCTAACCTCCTACTTTCCTGGATGCCACACAAGATCTTCCACTCAACAAGTCTGCAACTAAACTAGCCTTCCTCTTTTCAAACCTACTCTTCTTTCAGTGTTCTCAGTCACAAAAATTTGTACCAACTAGTTACCTAGTTGCACAACCCAAAATCTGGGAAAAATAATAGATTTCTTTCTCCATAGTACCCCAAAATCAATAAATCATCAAGTCTTATTCTACCTTCCAAAGAGCCTTACATATGTTCCTTTATTTTCATCTGTAACACCACTATTCCTGTCTAAGCCTACCTATGTCATTTTTGGAAGAGAATATAGTCACCTATGTGATCTTCCCACTTAAAATCCTATTATCTATGCTTCAGTAAAAGAAAAAAAATTTTTAATCTAAGTATGTAATTCTTTTGCTAAAGACACTTCACATGCTTCTGTGCCCTTAAACTGGTATGTTATCATGGTATAGTAGGCCATCCAAGACCTGGCTTCCTTCCTTTTTTTCAGTCTCAGAGAATAACGTACTCTTTCCCTGCAACTCCAGATCCAATTTGGTTTTCTTTTACTTGCCTGGAAACTTCAAATTCTATCAACTCTGGGGCTTTCCACTAGCTAATCATTTTGTATACAATATTTGTCCTTCATGTTTTGCCTCTTAACATCTCAGCTTTCAGTTTCATCATTTTACCAGGGAGGCCTCCCAGAACCTGAGTCCAGAAGAGTTCCTTCCATTGTATATTCCTCTAGCACTACCTAT;CE=1.90976;CONSBP=989  GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 0/0:0,-3.28138,-23.3388:33:PASS:10976:22465:11489:2:0:0:7:8
 chr1   66534   INS00000003 T   TATATATTATATAAATATAATATATATAATATATATTATATAAATATAATATATATAATATAATATATATTATATAAATATAATATATATTTTATTATATAATATAATATATATAATATAATATAAATTATATAAATATAATATATATTTTATTATATAATATAATATATATTATATAATATAATATATTTTATTATATAAATATATATTATATTCTATATAATATAATATATATTTTATTATATAATATATATTATATATTTATAGAATATAATATATATTTTATTATATAATATATATTATATAATATAATATATATTATATTTATATATAACATATATTATTATATAAAATATGTACTATATATTATATAAA 26  LowQual PRECISE;SVTYPE=INS;SVMETHOD=EMBL.DELLYv1.1.7;END=66534;SVLEN=374;PE=0;MAPQ=0;CT=NtoN;CIPOS=-7,7;CIEND=-7,7;SRMAPQ=5;INSLEN=374;HOMLEN=7;SR=5;SRQ=0.944791;CONSENSUS=TTTTGCTGTGATTCTTTAAAAAGCACCTTTAGACTTAGTGAGATAGCAAAAATATCCAAATAGGCCAAAAAATTGTGGCAATGTCCTCTCACTCAGGAAAATTCTGTGTGTTTTCTCTAATGGCCAAGGGAAAACTTGTGAGACTATAAAAGTTAGTCTCAGTACACAAAGCTCAGACTGGCTATTCCCAGATCTCTTCAGGTACATCTAGTCCATTCATAAAGGGCTTTTAATTAACCAAGTGGTTTACTAAAAAGGACAATTCACTACATATTATTCTCTTACAGTTTTTATGCTTCATTCTGTGAAAATTGCTGTAGTCTCTTCCAGTTATGAAGAAGGTAGGTGGAAACAAAGATAAAACACATATATTAGAAGAATGAATGAAATTGTAGCATTTTATTGACAATGAGATGATTCTATTAGTAGGAATCTATTCTGCATAATTCCATTTTGTGTTTACCTTCTGGAAAAATGAAAGGATTCTGTATGGTTAACTTAAATACTTAGAGGAATTAATATGAATAATGTTAGCAAGAATAACCCTTGTTATAAGTATTATGCCGGCAACAATTGTCGAGTCCTCCTCCTCACTCTTCTGGGCTAATTTGTTCTTTTCTCCCCATTTAATAGTCCTTTGCCCCATCTTTCCCCAGGTCCGGTGTTTTCTTACCCACCTCCTTCCCTCCTTTTTATAATACCAGTGAAACTTGGTTTGGAGCATTTCTTTCACATAAAGGTACAAATCATACTGCTAGAGTTGTGAGGATTTTTAGAGCTTTTGAAAGAATAAACTCATTTTAAAAACAGGAAAGCTAAGGCCCAGAGATTTTTAAATGATATTCCCATGATCACACTGTGAATTTGTGCCAGAACCCAAATGCCTACTCCCATCTCACTGAGACTTACTATAAGGACATAAGGCATTTTTATATATATATATTATATATACTATATATTTATATATATTACATATTATATATATAATATTATATATATAATATATATTATATTATATATATAATATATATAATATAATATATTATATATATTATATATATAATATATATAATATATTATATATATTATATATATAATATATATAATATATATAATATAATATATTTTATATATATATATTATATAATATATATATATTTTATATATATATTATATAATATAATATATATATTATATATATAATATATATATAATATAATATAATATAATATATTATATTATATAATATATGATATAAATATAATATATATTTTATTATATAATATAATATATATAATATAATATATATTATATAAATATAATATATATAATATATATTATATAAATATAATATATATAATATAATATATATTATATAAATATAATATATATTTTATTATATAATATAATATATATAATATAATATAAATTATATAAATATAATATATATTTTATTATATAATATAATATATATTATATAATATAATATATTTTATTATATAAATATATATTATATTCTATATAATATAATATATATTTTATTATATAATATATATTATATATTTATAGAATATAATATATATTTTATTATATAATATATATTATATAATATAATATATATTATATTTATATATAACATATATTATTATATAAAATATGTACTATATATTATATAAATATATTTATATATTATATAAATATATTTATATATTATATAAATATATATATTATATAAATATATTTATATATTATATAAATATATATATTATATATAATTCTAATGGTTGAATTCCAAGAATAATCTATGGCATGAAAGATTTTACCTGTCAACAGTGGCTGGCTCTTCATGGTTGCTACAATGAGTGTGTAAGATTCTGAAGGACTCCTTTAATAAGCCTAAACTTAATGTTCAACTTAGAATAAATACAATTCTTCTAAATTTTTTTGAATAATTTTTGAAAAGTCAGAAATGAGCTTTGAAAGAATTATGGTGGTGAAGGATCCCCTCAGCAGCACAAATTCAGGAGAGAGATGTCTTAACTACGTTAGCAAGAAATTCCTTTTGCTAAAGAATAGCATTCCTGAATTCTTACTAACAGCCATGATAGAAAGTCTTTTGCTACAGATGAGAACCCTCGGGTCAACCTCATCCTTGGCATATTTCATGTGAAGATATAACTTCAAGATTGTCCTTGCCTATCAATGAAATGAATTAATTTTATGTCAATGCATATTTAAGGTCTATTCTAAATTGCACACTTTGATTCAAAAGAAACAGTCCAACCAACCAGTCAGGACAGAAATTATCTCACAATAAAAATCCTATCATTTGTACTGTCAATGATTAGTATGATTATATTTATTACCGTGCTAAG;CE=1.773;CONSBP=1303    GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 0/0:0,-0.480512,-9.23288:6:LowQual:7991:16482:8491:2:0:0:6:5

Let us know if this would be possible, and what additional information you need from us. Thanks!

ielis commented 7 months ago

Hi @jessmewald

I looked into this. However, based on the errors, there is probably little I can do, because I think the VCF does not follow the VCF 4.2 specification.

There seem to be some issues with the VCF - some SVLEN fields seem to be wrong. For instance, based on the output line

Invalid variant `chr1-10991221:(DUP00000246)`: Illegal DUP!changeLength:0. Should be > 0 given coordinates 1:10991222-10994549 -><DUP>

I expect the VCF to contain a symbolic duplication with DUP00000246 identifier that has SVLEN=0 in the INFO field. This looks odd since the coordinates 1:10,991,222-10,994,549 indicate presence of ~3.3kb duplication. Therefore, the field should be something like SVLEN=3326.

This is because the definition of the SVLEN info field includes the following:

##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">

*One value for each ALT allele. Longer ALT alleles (e.g. insertions) have positive values, shorter ALT alleles (e.g. deletions) have negative values.

So, again, SVLEN should be positive for a duplication, where the ALT allele is longer.

Moreover, Delly seems to use the SVLEN field for another purpose, just to store the length of an insertion:

##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Insertion length for SVTYPE=INS.">

However, SVLEN is a reserved VCF field, so it should be used for its purpose - to store the length difference for all symbolic variants, not just for insertions, and put some random trash for other variants.

I am not sure that SvAnna code base is the place to fix these errors. Hopefully, Delly authors will fix this bug and produce valid VCF files.

So, to fix this in the short term, you'll probably need to write a Python script to set the SVLEN field with a correct value calculated from the coordinates, and run the script as part of your pipeline, right after Delly variant calling. It should be possible to calculate the coordinates from the POS and END fields for all symbolic variants except for INS. I can help with checking the script, I've been staring at variant coordinates long enough to develop some skills.. Please let me know if I can help.