Open jessmewald opened 10 months ago
Hi @jessmewald
this type of I/O-related issues are strange because SvAnna uses HtsJDK library to read VCF file, and HtsJDK is developed by the authors/maintainers of VCF format. So, this looks like an issue with the input VCF file on the first glance. Are you 100% sure the white spaces are consistent? Sometimes a VCF file contains spaces (`) instead of TAB characters (
\t`).
Some text editors have an option for showing white space characters. Then, a VCF file can look like this:
Note presence of small gray dots between the columns
A valid VCF should use \t
character (ASCII 09) as a column delimiter:
Note the presence of the gray arrows instead of dots
Can you please verify that your VCF, especially the offending line, uses the right column delimiters?
Thanks for your response! The offending line, and all other lines, seem to be tab delimited appropriately.
Yes, the lines on the screenshot look OK.
Can you please share the file /projects/clia/clia-LRS/scripts_jme/test/vcf/pav_NA12878.vcf.gz
? I'd like to replicate the bug on my side and then dig dipper. It looks like a test file, so, hopefully, there are not privacy concerns here..
Heres a link to the file on onedrive. Let me know if you can't access.
Hi @jessmewald I think this is not related to the source of the VCF (PAV) but it's an issue of VCF compression. The file you shared worked OK when uncompressed.
I made a new release with a fix. Can you please try it out and let me know if you run into any issues?
Thanks a lot and all the best.. :)
I wanted to update you on what I have learned so far. The files are no longer truncated, and the processing proceeds as I would expect. There are 80 or so warnings during processing of 5 million alleles. I've included an example of each warning below, although I don't think these warnings are driving our inability to produce a result.
21:06:45.264 [svanna-worker-1] WARN o.m.s.c.p.additive.Projections - Unexpected query `GeneDefault{id=GeneIdentifierDefault{accession='ENSG00000247746.5', symbol='USP51', hgncId='HGNC:23086', ncbiGeneId='null'}, location=GenomicRegion{contig=23, strand=-, coordinateSystem=ZERO_BASED, start=100551047, end=100556280}, transcripts=[CodingTranscriptDefault{id=TranscriptIdentifierDefault{accession='ENST00000500968.4', symbol='USP51-201', ccdsId='CCDS14370.1'}, location=GenomicRegion{contig=23, strand=-, coordinateSystem=ZERO_BASED, start=100551047, end=100556280}, exons=[Coordinates{coordinateSystem=ZERO_BASED, start=100551047, end=100551120}, Coordinates{coordinateSystem=ZERO_BASED, start=100551530, end=100551727}, Coordinates{coordinateSystem=ZERO_BASED, start=100551907, end=100556280}], cdsCoordinates=Coordinates{coordinateSystem=ZERO_BASED, start=100551956, end=100554092}}, TranscriptDefault{id=TranscriptIdentifierDefault{accession='ENST00000586165.1', symbol='USP51-202', ccdsId='null'}, location=GenomicRegion{contig=23, strand=-, coordinateSystem=ZERO_BASED, start=100552078, end=100554092}, exons=[Coordinates{coordinateSystem=ZERO_BASED, start=100552078, end=100552202}, Coordinates{coordinateSystem=ZERO_BASED, start=100552925, end=100554092}]}]}`
21:23:21.411 [svanna-worker-1] WARN o.m.s.c.p.a.i.GeneSequenceImpactCalculator - Bad insertion with nonzero length 306
21:24:33.847 [svanna-worker-2] WARN o.m.s.c.p.additive.Projections - Unexpected end event `SNV`
I've tried to process several samples, including cases with legitimate HPO terms that we would expect to see variants associated with. These same samples do produce prioritized variants when annotated with Exomiser. Neither the legitimate cases nor NA12878 and NA24385 with forced HPO terms produce any results. All variants result as "Low Alt Allele Count".
I tried to adjust --min-read-support
to see if that allowed any variants to pass, but it the did not change the results. My suspicion is that the field SvAnna tries to parse is missing from the PAV vcfs (perhaps DP or AD?). I did set the read support to zero to try to force it, but it did not produce a result. Thank you for your time and thoughts on this!
Hi @jessmewald thanks a lot for the update.
I think you're right about the DP
and AD
fields and this is actually a bug on SvAnna's side, where a variants without coverage information don't make it into the HTML report. This happens even if you tweak --min-read-support
option.
I think I have a fix and I pushed the code to report-unfiltered-variants-in-html
branch of the repo. Do you think you can test if it fixes the issue?
You would need to build SvAnna from sources, as described in the docs here with a little difference. You would need to switch the branch before building (see the extra line below).
So, something like this should work:
git clone https://github.com/TheJacksonLaboratory/SvAnna
cd SvAnna
git checkout report-unfiltered-variants-in-html && git pull # <-- the extra line
./mvnw package
After the build, you should get a distribution ZIP in the svanna-cli/target
folder which you can use to see if the problem persists. I expect that the patched code will place more most of the Low alt allele count
variants into Pass
.
Regarding the other issues - by default, issues with weird variants or genes are logged and the variants are ignored, in order to finish the analysis. It's hard to know how serious these issues are without additional context. Please let me know if you suspect something odd is going on.
Thanks Daniel! We are now able to annotate and prioritize the PAV VCFs with SvAnna using the updated code in the report-unfiltered-variants-in-html
branch.
Hi there, thanks for developing such an easy to use tool. We would like to run SvAnna with VCF outputs from PAV. It seems that SvAnna is truncating the vcfs in a way that is not obvious. We are running SvAnna 1.0.3 with PAV 2.2.4 (also developed at JAX).
We've tried a variety of PAV VCFs, and the SvAnna error is consistent among them. We get the following error:
For every VCF generated with PAV, SvAnna will stop reading the file at a line that it reports to not have enough columns. In all cases, the line it fails to read does contain the same number of columns, in the same format, as every other line.
Any help you can offer to trouble shoot this error would be greatly appreciated! Thanks for your time.