monarch-initiative / SvAnna

Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing
32 stars 4 forks source link

Compatibility with phased assembly variant caller? #237

Open jessmewald opened 10 months ago

jessmewald commented 10 months ago

Hi there, thanks for developing such an easy to use tool. We would like to run SvAnna with VCF outputs from PAV. It seems that SvAnna is truncating the vcfs in a way that is not obvious. We are running SvAnna 1.0.3 with PAV 2.2.4 (also developed at JAX).

We've tried a variety of PAV VCFs, and the SvAnna error is consistent among them. We get the following error:

17:36:31.652 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Using 6 phenotype features supplied via CLI
17:36:31.658 [main] INFO  o.m.svanna.cli.cmd.SvAnnaCommand - Spooling up SvAnna v1.0.3 using resources in /svanna_db_2304_hg38
17:36:41.046 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Reading variants from `/projects/clia/clia-LRS/scripts_jme/test/vcf/pav_NA12878.vcf.gz`
17:36:41.114 [main] WARN  o.m.svanna.io.parse.VcfVariantParser - Invalid VCF record: `Line 185: there aren't enough columns for line chr1   183407  chr1-183407-SNV-A-G A   G   .   .   ID=chr1-183407-SNV-A-G;SVTYPE=SNV;TIG (we expected 9 tokens, and saw 8 )`: `chr1    183407  chr1-183407-SNV-A-G A   G   .   .   ID=chr1-183407-SNV-A-G;SVTYPE=SNV;TIG`
17:36:41.115 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Read 185 variants
17:36:41.115 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants with reciprocal overlap >80.0% occurring in more than 1.0% probands
17:36:41.115 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants where ALT allele is supported by less than 3 reads
17:36:48.016 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritizing 185 variants on 2 threads
17:36:48.076 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritization finished in 0m 0s (57 ms) processing on average 3,245.61 items/s
17:36:48.077 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Writing out the results
17:36:48.087 [main] INFO  o.m.s.c.writer.html.HtmlResultWriter - Writing HTML results to /projects/clia/clia-LRS/scripts_jme/test/svanna/pav_NA12878_hg38_pbmm2_svanna.html
17:36:48.410 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - We're done, bye!

For every VCF generated with PAV, SvAnna will stop reading the file at a line that it reports to not have enough columns. In all cases, the line it fails to read does contain the same number of columns, in the same format, as every other line.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20230724
##source=PAV 2.2.4
##reference=file://hg38.fa
##contig=<ID=chr1,length=248956422,md5=2648ae1bacce4ec4b6cf337dcae37816>
##contig=<ID=chr10,length=133797422,md5=907112d17fcb73bcab1ed1c72b97ce68>
##contig=<ID=chr11,length=135086622,md5=1511375dc2dd1b633af8cf439ae90cec>
##contig=<ID=chr11_KI270721v1_random,length=100316,md5=9654b5d3f36845bb9d19a6dbd15d2f22>
##contig=<ID=chr12,length=133275309,md5=e81e16d3f44337034695a29b97708fce>
##contig=<ID=chr13,length=114364328,md5=17dab79b963ccd8e7377cef59a54fe1c>
##contig=<ID=chr14,length=107043718,md5=acbd9552c059d9b403e75ed26c1ce5bc>
##contig=<ID=chr14_GL000009v2_random,length=201709,md5=862f555045546733591ff7ab15bcecbe>
##contig=<ID=chr14_GL000225v1_random,length=211173,md5=63945c3e6962f28ffd469719a747e73c>
##contig=<ID=chr14_KI270722v1_random,length=194050,md5=51f46c9093929e6edc3b4dfd50d803fc>
##contig=<ID=chr14_GL000194v1_random,length=191469,md5=6ac8f815bf8e845bb3031b73f812c012>
##contig=<ID=chr14_KI270723v1_random,length=38115,md5=74a4b480675592095fb0c577c515b5df>
##contig=<ID=chr14_KI270724v1_random,length=39555,md5=c3fcb15dddf45f91ef7d94e2623ce13b>
##contig=<ID=chr14_KI270725v1_random,length=172810,md5=edc6402e58396b90b8738a5e37bf773d>
##contig=<ID=chr14_KI270726v1_random,length=43739,md5=fbe54a3197e2b469ccb2f4b161cfbe86>
##contig=<ID=chr15,length=101991189,md5=f036bd11158407596ca6bf3581454706>
##contig=<ID=chr15_KI270727v1_random,length=448248,md5=84fe18a7bf03f3b7fc76cbac8eb583f1>
##contig=<ID=chr16,length=90338345,md5=24e7cabfba3548a2bb4dff582b9ee870>
##contig=<ID=chr16_KI270728v1_random,length=1872759,md5=369ff74cf36683b3066a2ca929d9c40d>
##contig=<ID=chr17,length=83257441,md5=a8499ca51d6fb77332c2d242923994eb>
##contig=<ID=chr17_GL000205v2_random,length=185591,md5=458e71cd53dd1df4083dc7983a6c82c4>
##contig=<ID=chr17_KI270729v1_random,length=280839,md5=2756f6ee4f5780acce31e995443508b6>
##contig=<ID=chr17_KI270730v1_random,length=112551,md5=48f98ede8e28a06d241ab2e946c15e07>
##contig=<ID=chr18,length=80373285,md5=11eeaa801f6b0e2e36a1138616b8ee9a>
##contig=<ID=chr19,length=58617616,md5=b0eba2c7bb5c953d1e06a508b5e487de>
##contig=<ID=chr1_KI270706v1_random,length=175055,md5=62def1a794b3e18192863d187af956e6>
##contig=<ID=chr1_KI270707v1_random,length=32032,md5=78135804eb15220565483b7cdd02f3be>
##contig=<ID=chr1_KI270708v1_random,length=127682,md5=1e95e047b98ed92148dd84d6c037158c>
##contig=<ID=chr1_KI270709v1_random,length=66860,md5=4e2db2933ea96aee8dab54af60ecb37d>
##contig=<ID=chr1_KI270710v1_random,length=40176,md5=9949f776680c6214512ee738ac5da289>
.........(header lines removed for brevity).........
##INFO=<ID=ID,Number=1,Type=String,Description="Variant ID">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Variant type">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Variant length">
##INFO=<ID=TIG_REGION,Number=.,Type=String,Description="Contig region where variant was found (one per alt with h1 before h2 for homozygous calls)">
##INFO=<ID=QUERY_STRAND,Number=.,Type=String,Description="Strand of variant in the contig relative to the reference (order follows TIG_REGION)">
##INFO=<ID=INNER_REF,Number=.,Type=String,Description="Inversion inner breakpoint in reference coordinates (order follows TIG_REGION)">
##INFO=<ID=INNER_TIG,Number=.,Type=String,Description="Inversion inner breakpoint in contig coordinates (order follows TIG_REGION)">
##INFO=<ID=HOM_REF,Number=.,Type=String,Description="Perfect breakpoint homology (SV sequence vs reference). Format 'X,Y' where X homology upstream, and Y is homology downstream. Homology vs reference is often better for DEL.">
##INFO=<ID=HOM_TIG,Number=.,Type=String,Description="Perfect breakpoint homology (SV sequence vs contig). Format 'X,Y' where X homology upstream, and Y is homology downstream. Homology vs contig is often better for INS.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view pav_NA12878.vcf.gz; Date=Tue Nov 21 14:56:37 2023
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
chr1    182946  chr1-182946-SNV-G-A     G       A       .       .       ID=chr1-182946-SNV-G-A;SVTYPE=SNV;TIG_REGION=h1tg003033l:10814-10814;QUERY_STRAND=+     GT      1|.
chr1    182952  chr1-182952-SNV-A-G     A       G       .       .       ID=chr1-182952-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:10820-10820;QUERY_STRAND=+     GT      1|.
chr1    183025  chr1-183025-SNV-C-G     C       G       .       .       ID=chr1-183025-SNV-C-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:10893-10893;QUERY_STRAND=+     GT      1|.
chr1    183256  chr1-183256-SNV-A-G     A       G       .       .       ID=chr1-183256-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11124-11124;QUERY_STRAND=+     GT      1|.
chr1    183258  chr1-183258-SNV-A-G     A       G       .       .       ID=chr1-183258-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11126-11126;QUERY_STRAND=+     GT      1|.
chr1    183259  chr1-183259-SNV-G-A     G       A       .       .       ID=chr1-183259-SNV-G-A;SVTYPE=SNV;TIG_REGION=h1tg003033l:11127-11127;QUERY_STRAND=+     GT      1|.
chr1    183291  chr1-183291-SNV-A-G     A       G       .       .       ID=chr1-183291-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11159-11159;QUERY_STRAND=+     GT      1|.
chr1    183293  chr1-183293-SNV-G-T     G       T       .       .       ID=chr1-183293-SNV-G-T;SVTYPE=SNV;TIG_REGION=h1tg003033l:11161-11161;QUERY_STRAND=+     GT      1|.
chr1    183302  chr1-183302-SNV-A-G     A       G       .       .       ID=chr1-183302-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11170-11170;QUERY_STRAND=+     GT      1|.
chr1    183318  chr1-183318-SNV-A-G     A       G       .       .       ID=chr1-183318-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11186-11186;QUERY_STRAND=+     GT      1|.
chr1    183326  chr1-183326-SNV-C-G     C       G       .       .       ID=chr1-183326-SNV-C-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11194-11194;QUERY_STRAND=+     GT      1|.
chr1    183374  chr1-183374-SNV-C-G     C       G       .       .       ID=chr1-183374-SNV-C-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11242-11242;QUERY_STRAND=+     GT      1|.
chr1    183407  chr1-183407-SNV-A-G     A       G       .       .       ID=chr1-183407-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11275-11275;QUERY_STRAND=+     GT      1|.
chr1    183458  chr1-183458-SNV-C-G     C       G       .       .       ID=chr1-183458-SNV-C-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11326-11326;QUERY_STRAND=+     GT      1|.
chr1    183467  chr1-183467-SNV-T-C     T       C       .       .       ID=chr1-183467-SNV-T-C;SVTYPE=SNV;TIG_REGION=h1tg003033l:11335-11335;QUERY_STRAND=+     GT      1|.
chr1    183598  chr1-183598-SNV-C-G     C       G       .       .       ID=chr1-183598-SNV-C-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11466-11466;QUERY_STRAND=+     GT      1|.
chr1    183640  chr1-183640-SNV-T-C     T       C       .       .       ID=chr1-183640-SNV-T-C;SVTYPE=SNV;TIG_REGION=h1tg003033l:11508-11508;QUERY_STRAND=+     GT      1|.
chr1    183657  chr1-183657-SNV-T-C     T       C       .       .       ID=chr1-183657-SNV-T-C;SVTYPE=SNV;TIG_REGION=h1tg003033l:11525-11525;QUERY_STRAND=+     GT      1|.
chr1    183677  chr1-183677-SNV-G-A     G       A       .       .       ID=chr1-183677-SNV-G-A;SVTYPE=SNV;TIG_REGION=h1tg003033l:11545-11545;QUERY_STRAND=+     GT      1|.
chr1    183679  chr1-183679-SNV-A-G     A       G       .       .       ID=chr1-183679-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11547-11547;QUERY_STRAND=+     GT      1|.
chr1    183821  chr1-183821-SNV-T-G     T       G       .       .       ID=chr1-183821-SNV-T-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11689-11689;QUERY_STRAND=+     GT      1|.
chr1    183880  chr1-183880-SNV-G-A     G       A       .       .       ID=chr1-183880-SNV-G-A;SVTYPE=SNV;TIG_REGION=h1tg003033l:11748-11748;QUERY_STRAND=+     GT      1|.
chr1    183919  chr1-183919-SNV-T-G     T       G       .       .       ID=chr1-183919-SNV-T-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11787-11787;QUERY_STRAND=+     GT      1|.
chr1    183943  chr1-183943-SNV-A-G     A       G       .       .       ID=chr1-183943-SNV-A-G;SVTYPE=SNV;TIG_REGION=h1tg003033l:11811-11811;QUERY_STRAND=+     GT      1|.
chr1    184027  chr1-184027-SNV-T-C     T       C       .       .       ID=chr1-184027-SNV-T-C;SVTYPE=SNV;TIG_REGION=h1tg003033l:11895-11895;QUERY_STRAND=+     GT      1|.
chr1    184175  chr1-184176-INS-2       C       CAG     .       .       ID=chr1-184176-INS-2;SVTYPE=INS;SVLEN=2;TIG_REGION=h1tg003033l:12044-12045;QUERY_STRAND=+;HOM_REF=0,0;HOM_TIG=0,0       GT      1|.
chr1    184201  chr1-184201-SNV-C-T     C       T       .       .       ID=chr1-184201-SNV-C-T;SVTYPE=SNV;TIG_REGION=h1tg003033l:12071-12071;QUERY_STRAND=+     GT      1|.

Any help you can offer to trouble shoot this error would be greatly appreciated! Thanks for your time.

ielis commented 10 months ago

Hi @jessmewald

this type of I/O-related issues are strange because SvAnna uses HtsJDK library to read VCF file, and HtsJDK is developed by the authors/maintainers of VCF format. So, this looks like an issue with the input VCF file on the first glance. Are you 100% sure the white spaces are consistent? Sometimes a VCF file contains spaces (`) instead of TAB characters (\t`).

Some text editors have an option for showing white space characters. Then, a VCF file can look like this: image

Note presence of small gray dots between the columns

A valid VCF should use \t character (ASCII 09) as a column delimiter: image

Note the presence of the gray arrows instead of dots

Can you please verify that your VCF, especially the offending line, uses the right column delimiters?

jessmewald commented 10 months ago

Thanks for your response! The offending line, and all other lines, seem to be tab delimited appropriately.

image

ielis commented 10 months ago

Yes, the lines on the screenshot look OK.

Can you please share the file /projects/clia/clia-LRS/scripts_jme/test/vcf/pav_NA12878.vcf.gz? I'd like to replicate the bug on my side and then dig dipper. It looks like a test file, so, hopefully, there are not privacy concerns here..

jessmewald commented 10 months ago

Heres a link to the file on onedrive. Let me know if you can't access.

ielis commented 10 months ago

Hi @jessmewald I think this is not related to the source of the VCF (PAV) but it's an issue of VCF compression. The file you shared worked OK when uncompressed.

I made a new release with a fix. Can you please try it out and let me know if you run into any issues?

Thanks a lot and all the best.. :)

jessmewald commented 10 months ago

I wanted to update you on what I have learned so far. The files are no longer truncated, and the processing proceeds as I would expect. There are 80 or so warnings during processing of 5 million alleles. I've included an example of each warning below, although I don't think these warnings are driving our inability to produce a result.

21:06:45.264 [svanna-worker-1] WARN  o.m.s.c.p.additive.Projections - Unexpected query `GeneDefault{id=GeneIdentifierDefault{accession='ENSG00000247746.5', symbol='USP51', hgncId='HGNC:23086', ncbiGeneId='null'}, location=GenomicRegion{contig=23, strand=-, coordinateSystem=ZERO_BASED, start=100551047, end=100556280}, transcripts=[CodingTranscriptDefault{id=TranscriptIdentifierDefault{accession='ENST00000500968.4', symbol='USP51-201', ccdsId='CCDS14370.1'}, location=GenomicRegion{contig=23, strand=-, coordinateSystem=ZERO_BASED, start=100551047, end=100556280}, exons=[Coordinates{coordinateSystem=ZERO_BASED, start=100551047, end=100551120}, Coordinates{coordinateSystem=ZERO_BASED, start=100551530, end=100551727}, Coordinates{coordinateSystem=ZERO_BASED, start=100551907, end=100556280}], cdsCoordinates=Coordinates{coordinateSystem=ZERO_BASED, start=100551956, end=100554092}}, TranscriptDefault{id=TranscriptIdentifierDefault{accession='ENST00000586165.1', symbol='USP51-202', ccdsId='null'}, location=GenomicRegion{contig=23, strand=-, coordinateSystem=ZERO_BASED, start=100552078, end=100554092}, exons=[Coordinates{coordinateSystem=ZERO_BASED, start=100552078, end=100552202}, Coordinates{coordinateSystem=ZERO_BASED, start=100552925, end=100554092}]}]}`
21:23:21.411 [svanna-worker-1] WARN  o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 306
21:24:33.847 [svanna-worker-2] WARN  o.m.s.c.p.additive.Projections - Unexpected end event `SNV`

I've tried to process several samples, including cases with legitimate HPO terms that we would expect to see variants associated with. These same samples do produce prioritized variants when annotated with Exomiser. Neither the legitimate cases nor NA12878 and NA24385 with forced HPO terms produce any results. All variants result as "Low Alt Allele Count". image

I tried to adjust --min-read-support to see if that allowed any variants to pass, but it the did not change the results. My suspicion is that the field SvAnna tries to parse is missing from the PAV vcfs (perhaps DP or AD?). I did set the read support to zero to try to force it, but it did not produce a result. Thank you for your time and thoughts on this!

ielis commented 10 months ago

Hi @jessmewald thanks a lot for the update.

I think you're right about the DP and AD fields and this is actually a bug on SvAnna's side, where a variants without coverage information don't make it into the HTML report. This happens even if you tweak --min-read-support option.

I think I have a fix and I pushed the code to report-unfiltered-variants-in-html branch of the repo. Do you think you can test if it fixes the issue?

You would need to build SvAnna from sources, as described in the docs here with a little difference. You would need to switch the branch before building (see the extra line below).

So, something like this should work:

git clone https://github.com/TheJacksonLaboratory/SvAnna
cd SvAnna
git checkout report-unfiltered-variants-in-html && git pull  # <-- the extra line
./mvnw package

After the build, you should get a distribution ZIP in the svanna-cli/target folder which you can use to see if the problem persists. I expect that the patched code will place more most of the Low alt allele count variants into Pass.

Regarding the other issues - by default, issues with weird variants or genes are logged and the variants are ignored, in order to finish the analysis. It's hard to know how serious these issues are without additional context. Please let me know if you suspect something odd is going on.

jessmewald commented 10 months ago

Thanks Daniel! We are now able to annotate and prioritize the PAV VCFs with SvAnna using the updated code in the report-unfiltered-variants-in-html branch.