monarch-initiative / SvAnna

Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing
32 stars 4 forks source link

Priotization output in HTML vs TSV report #239

Closed AidanGCr closed 7 months ago

AidanGCr commented 10 months ago

Hello,

I am running SvAnna for prioritization of short-read variants from MetaSV output. While I am getting an output in both the noted formats, the HTML format is not prioritizing any variants that are scored in the TSV format, some of which have no noted failed filter in the TSV file. This said, all of my variants are failing "low alt allele count" column in the HTML output. This field is, to my knowledge, undocumented, and after reading the paper/searching the docs, I couldn't find a definitive answer to what this corresponds to. Would this be why none of the SV's are being prioritized in the HTML?

Thanks, Aidan

ielis commented 10 months ago

Hi @AidanGCr thanks a lot for raising this issue.

My best guess is that the issue is caused by the VCF format. SV callers report the read depth in an inconsistent way. Some callers store the depth into AD field, others use DR or DV. We developed SvAnna to work with pbsv, Sniffles, and svim (see Table S1 of SvAnna manuscript).

To further troubleshoot this issue, I would like to ask you for an excerpt of your VCF file. E.g. first 500 lines should be enough. Can you please attach the VCF to this issue? First, I would like to replicate the issue on my end and then work on the solution.

Thanks & all the best, Daniel

AidanGCr commented 10 months ago

Hi Daniel,

Thank you for the super prompt response! I had suspected that it may be an issue with the VCF depth annotation fields (or lack thereof), as one thing I left out is the number of warnings I am getting. For example, checking the job output (I am on an HPC using slurm) with...

cat slurm-19441329.out | grep "SUPPORT" | head -10

yields...

13:46:33.992 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [11, 9]
13:46:33.998 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [6, 4]
13:46:34.009 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [13, 12]
13:46:34.013 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [5, 8]
13:46:34.027 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [12, 9]
13:46:34.040 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [11, 11]
13:46:34.046 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [17, 17]
13:46:34.053 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [3, 2]
13:46:34.056 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [4, 2]
13:46:34.071 [main] WARN  o.m.s.i.p.VariantCallAttributeParser - Unable to cast `SUPPORT` attribute: [13, 10]

Also wc -l on the grep yields 45648 such lines. This I believe directly implicates the allelic depths portion of the if/else block in VariantCallAttributeParser. I'm not sure how you are parsing the VCF header in this code (my bad, I haven't looked into it too much), however, I will attach here the header. I am not sure I can give variant data as these are in-house sequenced clinical germline calls, however, I will check with my PI when I see them today. The header (as observed with bcftools view -h:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=/cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa
##fileDate=2023-01-18
##source=/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/run_metasv.py --boost_sc --filter_gaps --keep_standard_contigs --mean_read_length 150 --num_threads 8 --reference /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa --lumpy_vcf output/Sample_ID/Sample_ID.lumpy.germline.noBND.vcf.gz --manta_vcf /scratch/ruili/OncoBr_WGS/SV_analysis/output/Sample_ID/Sample_ID.manta.germline.noBND.vcf.gz --cnvkit_vcf SVariants/Sample_ID/Sample_ID.cnvkit.germline.vcf.gz --wham_vcf SVariants/Sample_ID/Sample_ID.wham.germline.vcf.gz --gatk_vcf alignment/Sample_ID/Sample_ID.hc.flt.vcf.gz --breakseq_vcf SVariants/Sample_ID/Sample_ID.breakseq.germline.vcf.gz --bam alignment/Sample_ID/Sample_ID.sorted.dup.recal.bam --sample Sample_ID --spades spades.py --age age_align --workdir SVariants/ensemble/Sample_ID/rawMetaSV --outdir SVariants/ensemble/Sample_ID --isize_mean 325 --isize_sd 50
##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVCNV,Number=0,Type=Flag,Description="Structural variation or copy number variation">
##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=NORMAL_COUNT,Number=1,Type=Integer,Description="Number of normal reads supporting reference">
##INFO=<ID=NUM_READS,Number=1,Type=Integer,Description="Number of reads supporting event">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score reported by tool">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=METRANS,Number=4,Type=String,Description="Mobile element transduction info of the form CHR,START,END,POLARITY">
##INFO=<ID=DGVID,Number=1,Type=String,Description="ID of this element in Database of Genomic Variation">
##INFO=<ID=DBVARID,Number=1,Type=String,Description="ID of this element in DBVAR">
##INFO=<ID=DBRIPID,Number=1,Type=String,Description="ID of this element in DBRIP">
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends">
##INFO=<ID=PARID,Number=1,Type=String,Description="ID of partner breakend">
##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend">
##INFO=<ID=CILEN,Number=2,Type=Integer,Description="Confidence interval around the length of the inserted material between breakends">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend">
##INFO=<ID=DPADJ,Number=.,Type=Integer,Description="Read Depth of adjacency">
##INFO=<ID=CN,Number=1,Type=Integer,Description="Copy number of segment containing breakend">
##INFO=<ID=CNADJ,Number=.,Type=Integer,Description="Copy number of adjacency">
##INFO=<ID=CICN,Number=2,Type=Integer,Description="Confidence interval around copy number for the segment">
##INFO=<ID=CICNADJ,Number=.,Type=Integer,Description="Confidence interval around copy number for the adjacency">
##INFO=<ID=SVTOOL,Number=1,Type=String,Description="Tool used to generate the SV">
##INFO=<ID=SOURCES,Number=.,Type=String,Description="List of original raw SV calls as Toolname:Start:End:Size">
##INFO=<ID=NUM_SVMETHODS,Number=1,Type=Integer,Description="Number of methods supporting the event">
##INFO=<ID=NUM_SVTOOLS,Number=1,Type=Integer,Description="Number of SV callers supporting the event">
##INFO=<ID=BA_FLANK_PERCENT,Number=1,Type=Float,Description="Breakpoint assembly: percent of total length in flanks">
##INFO=<ID=BA_NFRAGS,Number=1,Type=Float,Description="Breakpoint assembly: Number of fragments in the alignment">
##INFO=<ID=BA_NUM_ALT,Number=1,Type=Float,Description="Breakpoint assembly: Number of alternative regions">
##INFO=<ID=BA_NUM_BP,Number=1,Type=Float,Description="Breakpoint assembly: Number of breakpoints">
##INFO=<ID=BA_NUM_GOOD_REC,Number=1,Type=Float,Description="Breakpoint assembly: Number of good records in assembly">
##INFO=<ID=BA_PERCENT_MATCH,Number=1,Type=Float,Description="Breakpoint assembly: percent of matching bases in alignment">
##INFO=<ID=BA_BP_SCORE,Number=1,Type=Float,Description="Breakpoint assembly: breakpoint score">
##INFO=<ID=VT,Number=1,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=SVMETHOD,Number=.,Type=String,Description="Type of approach used to detect SV: RP (read pair), RD (read depth), SR (split read), JT (junction) or AS (assembly)">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation">
##INFO=<ID=POS2,Number=1,Type=String,Description="The new position of the translocated or duplicated region. Used of ITX, CTX, and IDP SVs">
##INFO=<ID=BD_CHR1,Number=1,Type=String,Description="BreakDancer: chr1 field">
##INFO=<ID=BD_POS1,Number=1,Type=Integer,Description="BreakDancer: pos1 field">
##INFO=<ID=BD_ORI1,Number=1,Type=String,Description="BreakDancer: orientation1 field">
##INFO=<ID=BD_CHR2,Number=1,Type=String,Description="BreakDancer: chr2 field">
##INFO=<ID=BD_POS2,Number=1,Type=Integer,Description="BreakDancer: pos2 field">
##INFO=<ID=BD_ORI2,Number=1,Type=String,Description="BreakDancer: orientation2 field">
##INFO=<ID=BD_SCORE,Number=1,Type=Float,Description="BreakDancer: score field">
##INFO=<ID=BD_SUPPORTING_READ_PAIRS,Number=1,Type=Integer,Description="BreakDancer: num_Reads field">
##INFO=<ID=BS_START,Number=1,Type=Integer,Description="BreakSeq: start field">
##INFO=<ID=BS_END,Number=1,Type=Integer,Description="BreakSeq: end field">
##INFO=<ID=BS_SCORE,Number=1,Type=Integer,Description="BreakSeq: score field">
##INFO=<ID=BS_FILTER,Number=1,Type=String,Description="BreakSeq: filter field">
##INFO=<ID=BS_ABC_COUNTS,Number=3,Type=Integer,Description="BreakSeq: counts for various breakpoint types">
##INFO=<ID=BS_PE_COUNT,Number=1,Type=Integer,Description="BreakSeq: paired support for the SV">
##INFO=<ID=CN_NORMALIZED_RD,Number=1,Type=Float,Description="CNVnator: Normalized RD">
##INFO=<ID=CN_EVAL1,Number=1,Type=Float,Description="CNVnator: e-val by t-test">
##INFO=<ID=CN_EVAL2,Number=1,Type=Float,Description="CNVnator: e-val by Gaussian tail">
##INFO=<ID=CN_EVAL3,Number=1,Type=Float,Description="CNVnator: e-val by t-test (middle)">
##INFO=<ID=CN_EVAL4,Number=1,Type=Float,Description="CNVnator: e-val by Gaussian tail (middle)">
##INFO=<ID=CN_Q0,Number=1,Type=Float,Description="CNVnator: Fraction of reads with 0 mapping quality">
##INFO=<ID=PD_NUM_NT_ADDED,Number=.,Type=Integer,Description="Pindel: number of nucleotides added">
##INFO=<ID=PD_NT_ADDED,Number=.,Type=String,Description="Pindel: sequenced of nucleotides added">
##INFO=<ID=PD_BP_RANGE_START,Number=1,Type=Integer,Description="Pindel: breakpoint range start">
##INFO=<ID=PD_BP_RANGE_END,Number=1,Type=Integer,Description="Pindel: breakpoint range end">
##INFO=<ID=PD_READ_SUPP,Number=1,Type=Integer,Description="Pindel: number of reads supporting the SV">
##INFO=<ID=PD_UNIQ_READ_SUPP,Number=1,Type=Integer,Description="Pindel: number of uniquely mapped reads supporting the SV">
##INFO=<ID=PD_UP_READ_SUPP,Number=1,Type=Integer,Description="Pindel: number of up reads supporting the SV">
##INFO=<ID=PD_UP_UNIQ_READ_SUPP,Number=1,Type=Integer,Description="Pindel: number of up uniquely mapped reads supporting the SV">
##INFO=<ID=PD_DOWN_READ_SUPP,Number=1,Type=Integer,Description="Pindel: number of down reads supporting the SV">
##INFO=<ID=PD_DOWN_UNIQ_READ_SUPP,Number=1,Type=Integer,Description="Pindel: number of down uniquely mapped reads supporting the SV">
##INFO=<ID=PD_SIMPLE_SCORE,Number=1,Type=Integer,Description="Pindel: simple score for the SV">
##INFO=<ID=PD_SUM_MAPQ,Number=1,Type=Integer,Description="Pindel: sum of mapq of the supporting reads">
##INFO=<ID=PD_NUM_SAMPLE,Number=1,Type=Integer,Description="Pindel: number of samples">
##INFO=<ID=PD_NUM_SAMPLE_SUPP,Number=1,Type=Integer,Description="Pindel: number of supporting samples">
##INFO=<ID=PD_NUM_SAMPLE_UNIQ_SUPP,Number=1,Type=Integer,Description="Pindel: number of samples with unique supporting reads">
##INFO=<ID=PD_HOMLEN,Number=1,Type=Integer,Description="Pindel: length of sequence homology around the breakpoints">
##INFO=<ID=PD_HOMSEQ,Number=.,Type=String,Description="Pindel: homologous sequence around the breakpoints">
##INFO=<ID=SC_READ_SUPPORT,Number=.,Type=Integer,Description="Soft-Clip: Number of soft-clip supporting reads">
##INFO=<ID=SC_NEIGH_SUPPORT,Number=.,Type=Integer,Description="Soft-Clip: Number of reads near the soft-clip location that support SV event">
##INFO=<ID=SC_COVERAGE,Number=.,Type=Integer,Description="Soft-Clip: Read coverage in the soft-clip location">
##INFO=<ID=SC_LOCATIONS,Number=.,Type=Integer,Description="Soft-Clip: Locations of soft-clips for the SV">
##INFO=<ID=IDP_INTERVALS,Number=.,Type=String,Description="Intervals used to merge into interspersed duplication">
##INFO=<ID=ITX_INTERVALS,Number=.,Type=String,Description="Intervals used to merge into intra-chromosomal translocation">
##INFO=<ID=CTX_INTERVALS,Number=.,Type=String,Description="Intervals used to merge into inter-chromosomal translocation">
##INFO=<ID=INSERTION_SEQUENCE,Number=1,Type=String,Description="Sequence inserted for long insertions">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events">
##FORMAT=<ID=CNL,Number=.,Type=Float,Description="Copy number genotype likelihood for imprecise events">
##FORMAT=<ID=NQ,Number=1,Type=Integer,Description="Phred style probability score that the variant is novel with respect to the genome's ancestor">
##FORMAT=<ID=HAP,Number=1,Type=Integer,Description="Unique haplotype identifier">
##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description="Unique identifier of ancestral haplotype">
##FILTER=<ID=LowQual,Description="Low Quality">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=DEL:ME:L1,Description="Deletion of L1 element">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=IDP,Description="Interspersed Duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">
##ALT=<ID=INS:ME:L1,Description="Insertion of L1 element">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CNV,Description="Copy number variable region">
##ALT=<ID=ITX,Description="Intra-chromosomal translocation">
##ALT=<ID=CTX,Description="Inter-chromosomal translocation">
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##contig=<ID=chrM,length=16569>
##contig=<ID=chr1_KI270706v1_random,length=175055>
##contig=<ID=chr1_KI270707v1_random,length=32032>
##contig=<ID=chr1_KI270708v1_random,length=127682>
##contig=<ID=chr1_KI270709v1_random,length=66860>
##contig=<ID=chr1_KI270710v1_random,length=40176>
##contig=<ID=chr1_KI270711v1_random,length=42210>
##contig=<ID=chr1_KI270712v1_random,length=176043>
##contig=<ID=chr1_KI270713v1_random,length=40745>
##contig=<ID=chr1_KI270714v1_random,length=41717>
##contig=<ID=chr2_KI270715v1_random,length=161471>
##contig=<ID=chr2_KI270716v1_random,length=153799>
##contig=<ID=chr3_GL000221v1_random,length=155397>
##contig=<ID=chr4_GL000008v2_random,length=209709>
##contig=<ID=chr5_GL000208v1_random,length=92689>
##contig=<ID=chr9_KI270717v1_random,length=40062>
##contig=<ID=chr9_KI270718v1_random,length=38054>
##contig=<ID=chr9_KI270719v1_random,length=176845>
##contig=<ID=chr9_KI270720v1_random,length=39050>
##contig=<ID=chr11_KI270721v1_random,length=100316>
##contig=<ID=chr14_GL000009v2_random,length=201709>
##contig=<ID=chr14_GL000225v1_random,length=211173>
##contig=<ID=chr14_KI270722v1_random,length=194050>
##contig=<ID=chr14_GL000194v1_random,length=191469>
##contig=<ID=chr14_KI270723v1_random,length=38115>
##contig=<ID=chr14_KI270724v1_random,length=39555>
##contig=<ID=chr14_KI270725v1_random,length=172810>
##contig=<ID=chr14_KI270726v1_random,length=43739>
##contig=<ID=chr15_KI270727v1_random,length=448248>
##contig=<ID=chr16_KI270728v1_random,length=1872759>
##contig=<ID=chr17_GL000205v2_random,length=185591>
##contig=<ID=chr17_KI270729v1_random,length=280839>
##contig=<ID=chr17_KI270730v1_random,length=112551>
##contig=<ID=chr22_KI270731v1_random,length=150754>
##contig=<ID=chr22_KI270732v1_random,length=41543>
##contig=<ID=chr22_KI270733v1_random,length=179772>
##contig=<ID=chr22_KI270734v1_random,length=165050>
##contig=<ID=chr22_KI270735v1_random,length=42811>
##contig=<ID=chr22_KI270736v1_random,length=181920>
##contig=<ID=chr22_KI270737v1_random,length=103838>
##contig=<ID=chr22_KI270738v1_random,length=99375>
##contig=<ID=chr22_KI270739v1_random,length=73985>
##contig=<ID=chrY_KI270740v1_random,length=37240>
##contig=<ID=chrUn_KI270302v1,length=2274>
##contig=<ID=chrUn_KI270304v1,length=2165>
##contig=<ID=chrUn_KI270303v1,length=1942>
##contig=<ID=chrUn_KI270305v1,length=1472>
##contig=<ID=chrUn_KI270322v1,length=21476>
##contig=<ID=chrUn_KI270320v1,length=4416>
##contig=<ID=chrUn_KI270310v1,length=1201>
##contig=<ID=chrUn_KI270316v1,length=1444>
##contig=<ID=chrUn_KI270315v1,length=2276>
##contig=<ID=chrUn_KI270312v1,length=998>
##contig=<ID=chrUn_KI270311v1,length=12399>
##contig=<ID=chrUn_KI270317v1,length=37690>
##contig=<ID=chrUn_KI270412v1,length=1179>
##contig=<ID=chrUn_KI270411v1,length=2646>
##contig=<ID=chrUn_KI270414v1,length=2489>
##contig=<ID=chrUn_KI270419v1,length=1029>
##contig=<ID=chrUn_KI270418v1,length=2145>
##contig=<ID=chrUn_KI270420v1,length=2321>
##contig=<ID=chrUn_KI270424v1,length=2140>
##contig=<ID=chrUn_KI270417v1,length=2043>
##contig=<ID=chrUn_KI270422v1,length=1445>
##contig=<ID=chrUn_KI270423v1,length=981>
##contig=<ID=chrUn_KI270425v1,length=1884>
##contig=<ID=chrUn_KI270429v1,length=1361>
##contig=<ID=chrUn_KI270442v1,length=392061>
##contig=<ID=chrUn_KI270466v1,length=1233>
##contig=<ID=chrUn_KI270465v1,length=1774>
##contig=<ID=chrUn_KI270467v1,length=3920>
##contig=<ID=chrUn_KI270435v1,length=92983>
##contig=<ID=chrUn_KI270438v1,length=112505>
##contig=<ID=chrUn_KI270468v1,length=4055>
##contig=<ID=chrUn_KI270510v1,length=2415>
##contig=<ID=chrUn_KI270509v1,length=2318>
##contig=<ID=chrUn_KI270518v1,length=2186>
##contig=<ID=chrUn_KI270508v1,length=1951>
##contig=<ID=chrUn_KI270516v1,length=1300>
##contig=<ID=chrUn_KI270512v1,length=22689>
##contig=<ID=chrUn_KI270519v1,length=138126>
##contig=<ID=chrUn_KI270522v1,length=5674>
##contig=<ID=chrUn_KI270511v1,length=8127>
##contig=<ID=chrUn_KI270515v1,length=6361>
##contig=<ID=chrUn_KI270507v1,length=5353>
##contig=<ID=chrUn_KI270517v1,length=3253>
##contig=<ID=chrUn_KI270529v1,length=1899>
##contig=<ID=chrUn_KI270528v1,length=2983>
##contig=<ID=chrUn_KI270530v1,length=2168>
##contig=<ID=chrUn_KI270539v1,length=993>
##contig=<ID=chrUn_KI270538v1,length=91309>
##contig=<ID=chrUn_KI270544v1,length=1202>
##contig=<ID=chrUn_KI270548v1,length=1599>
##contig=<ID=chrUn_KI270583v1,length=1400>
##contig=<ID=chrUn_KI270587v1,length=2969>
##contig=<ID=chrUn_KI270580v1,length=1553>
##contig=<ID=chrUn_KI270581v1,length=7046>
##contig=<ID=chrUn_KI270579v1,length=31033>
##contig=<ID=chrUn_KI270589v1,length=44474>
##contig=<ID=chrUn_KI270590v1,length=4685>
##contig=<ID=chrUn_KI270584v1,length=4513>
##contig=<ID=chrUn_KI270582v1,length=6504>
##contig=<ID=chrUn_KI270588v1,length=6158>
##contig=<ID=chrUn_KI270593v1,length=3041>
##contig=<ID=chrUn_KI270591v1,length=5796>
##contig=<ID=chrUn_KI270330v1,length=1652>
##contig=<ID=chrUn_KI270329v1,length=1040>
##contig=<ID=chrUn_KI270334v1,length=1368>
##contig=<ID=chrUn_KI270333v1,length=2699>
##contig=<ID=chrUn_KI270335v1,length=1048>
##contig=<ID=chrUn_KI270338v1,length=1428>
##contig=<ID=chrUn_KI270340v1,length=1428>
##contig=<ID=chrUn_KI270336v1,length=1026>
##contig=<ID=chrUn_KI270337v1,length=1121>
##contig=<ID=chrUn_KI270363v1,length=1803>
##contig=<ID=chrUn_KI270364v1,length=2855>
##contig=<ID=chrUn_KI270362v1,length=3530>
##contig=<ID=chrUn_KI270366v1,length=8320>
##contig=<ID=chrUn_KI270378v1,length=1048>
##contig=<ID=chrUn_KI270379v1,length=1045>
##contig=<ID=chrUn_KI270389v1,length=1298>
##contig=<ID=chrUn_KI270390v1,length=2387>
##contig=<ID=chrUn_KI270387v1,length=1537>
##contig=<ID=chrUn_KI270395v1,length=1143>
##contig=<ID=chrUn_KI270396v1,length=1880>
##contig=<ID=chrUn_KI270388v1,length=1216>
##contig=<ID=chrUn_KI270394v1,length=970>
##contig=<ID=chrUn_KI270386v1,length=1788>
##contig=<ID=chrUn_KI270391v1,length=1484>
##contig=<ID=chrUn_KI270383v1,length=1750>
##contig=<ID=chrUn_KI270393v1,length=1308>
##contig=<ID=chrUn_KI270384v1,length=1658>
##contig=<ID=chrUn_KI270392v1,length=971>
##contig=<ID=chrUn_KI270381v1,length=1930>
##contig=<ID=chrUn_KI270385v1,length=990>
##contig=<ID=chrUn_KI270382v1,length=4215>
##contig=<ID=chrUn_KI270376v1,length=1136>
##contig=<ID=chrUn_KI270374v1,length=2656>
##contig=<ID=chrUn_KI270372v1,length=1650>
##contig=<ID=chrUn_KI270373v1,length=1451>
##contig=<ID=chrUn_KI270375v1,length=2378>
##contig=<ID=chrUn_KI270371v1,length=2805>
##contig=<ID=chrUn_KI270448v1,length=7992>
##contig=<ID=chrUn_KI270521v1,length=7642>
##contig=<ID=chrUn_GL000195v1,length=182896>
##contig=<ID=chrUn_GL000219v1,length=179198>
##contig=<ID=chrUn_GL000220v1,length=161802>
##contig=<ID=chrUn_GL000224v1,length=179693>
##contig=<ID=chrUn_KI270741v1,length=157432>
##contig=<ID=chrUn_GL000226v1,length=15008>
##contig=<ID=chrUn_GL000213v1,length=164239>
##contig=<ID=chrUn_KI270743v1,length=210658>
##contig=<ID=chrUn_KI270744v1,length=168472>
##contig=<ID=chrUn_KI270745v1,length=41891>
##contig=<ID=chrUn_KI270746v1,length=66486>
##contig=<ID=chrUn_KI270747v1,length=198735>
##contig=<ID=chrUn_KI270748v1,length=93321>
##contig=<ID=chrUn_KI270749v1,length=158759>
##contig=<ID=chrUn_KI270750v1,length=148850>
##contig=<ID=chrUn_KI270751v1,length=150742>
##contig=<ID=chrUn_KI270752v1,length=27745>
##contig=<ID=chrUn_KI270753v1,length=62944>
##contig=<ID=chrUn_KI270754v1,length=40191>
##contig=<ID=chrUn_KI270755v1,length=36723>
##contig=<ID=chrUn_KI270756v1,length=79590>
##contig=<ID=chrUn_KI270757v1,length=71251>
##contig=<ID=chrUn_GL000214v1,length=137718>
##contig=<ID=chrUn_KI270742v1,length=186739>
##contig=<ID=chrUn_GL000216v2,length=176608>
##contig=<ID=chrUn_GL000218v1,length=161147>
##contig=<ID=chrEBV,length=171823>
##SnpEffVersion="4.3k (build 2017-03-29 17:16), by Pablo Cingolani"
##SnpEffCmd="SnpEff  -i vcf -o vcf -csvStats SVariants/ensemble/Sample_ID/Sample_ID.metasv.snpeff.vcf.stats.csv -stats SVariants/ensemble/Sample_ID/Sample_ID.metasv.snpeff.vcf.stats.html hg38 SVariants/ensemble/Sample_ID/variants.vcf.gz "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##bcftools_viewVersion=1.11+htslib-1.11
##bcftools_viewCommand=view -h Sample_ID.metasv.snpeff.vcf; Date=Wed Nov 29 07:40:58 2023
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample_ID

Hopefully, this will help somewhat. From looking at the header, it seems that there are none of the parsed-for fields in the header such as AD, DR, and DV, however, the SUPPORT key is being identified for some reason. I was under the impression that if a VCF header did not have these fields, the variant would pass the filter and be considered in prioritization. Is this assessment incorrect?

Lastly, I wanted to show more concretely the nature of the problem. The HTML report for a single sample:

Screenshot from 2023-11-29 10-28-32

Note that then it goes directly onto the "About" section, skipping prioritization. The TSV output (first 30 variants) gives these values for the "failed filters" and "psv" columns:

| failed_filters | psv              |
|----------------|------------------|
| freq           | 124.23           |
| filter;freq    | 90.349090909091  |
| filter         | 68.6754919274339 |
| freq           | 15.4858138741546 |
| filter         | 13.4825941874319 |
| freq           | 10               |
| freq           | 10               |
| freq           | 9                |
| freq           | 8.77601591638016 |
| freq           | 8.68741258741259 |
| filter         | 8.1964108454124  |
|                | 8                |
| freq           | 7                |
| filter         | 6.49414088852115 |
| filter         | 6.33840289157163 |
| filter;freq    | 6                |
|                | 6                |
| filter         | 6                |
| freq           | 5.71270338011424 |
| filter         | 5.65235240172232 |
| filter         | 5.64097137351207 |
| filter         | 5.51605484522768 |
| filter         | 5.48579381982279 |
| filter         | 5.4              |
| filter         | 5.20485686756036 |
| filter         | 5.13501421352313 |
| filter         | 5.11592210204163 |
| filter         | 5.0932438155873  |
| filter         | 5                |
|                | 5                |

Since a few have no failed filters, I expected them to be in the HTML output and visualized.

Sorry for the long response!

Thanks, Aidan

ielis commented 10 months ago

Hi @AidanGCr

I think there is an issue on both sides here: in your VCFs and in the code for generating HTML reports in SvAnna.

First, your issue. The SUPPORT field is not declared in the VCF header, yet it seems to be present in many variant lines, as pointed out by your grep run.

The logging statements

Unable to cast `SUPPORT` attribute: [11, 9]

are there because a variant has a SUPPORT INFO field but its value is not a simple int but an int array. SvAnna expects to find the SUPPORT INFO field in VCFs produced by svim SV caller. However, your VCF seems to have an int array there, which is currently unparsable. In result, all variants are treated as if the read depth was unavailable.

This brings me to the problem on SvAnna's side. As of now, the variants that lack read depth are written out into tabular and VCF formats but not into HTML. I think this should actually be changed and SvAnna should write out the variants that passed the filters (implying not failing the coverage filter since a variant cannot fail the coverage filter if the coverage data is N/A).

I think I can fix the HTML issue and push the fix to a GitHub branch. Then I can either send you a distribution ZIP file with the fixed code, or you can build it on your own, if you know how to do that.

Second, do you actually want to filter based on the read depth?

Please let me know, thanks!

ielis commented 7 months ago

Hey @AidanGCr pls reopen if this is still an issue.