monarch-initiative / SvAnna

Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing
34 stars 4 forks source link

Read 0 Variants from bgzip VCF (bgzip version >= 1.17 or bcftools version >= 1.16) #235

Closed james-lawlor closed 1 year ago

james-lawlor commented 1 year ago

Hi, I've noticed a problem where bgzip-compressed input files created with certain versions of htslib/bgzip or bcftools cause SvAnna to read 0 variants (and produce output files with 0 variants).

For example, using the example.vcf file included with SvAnna, sorting and compressing with bcftools v1.15 works as expected, but bcftools v1.16 reads 0 variants. I have also encountered this issue with bgzip v1.17 and v1.18. (bgzip v1.16 works as expected but bcftools v1.16 does not).

Partially, this seems due to the "text mode for bgzip" change implemented here https://github.com/samtools/htslib/releases/tag/1.17 as disabling the bgzip blocks at line break by adding --binary to the bgzip command solves the issue. However, I'm not sure why this also occurs with bcftools compressed output in v1.16.

(I was able to replicate this both with different versions of htslib/bcftools installed via conda and that I downloaded and compiled locally.)

Example:

bcftools v1.15.1 compressed output:

$ bcftools sort -o bcf_15.vcf.gz example.vcf
$ tabix bcf_15.vcf.gz
$ java -jar /cluster/lab/gcooper/software/svanna-cli-1.0.3/svanna-cli-1.0.3.jar prioritize --data-directory /cluster/lab/gcooper/software/svanna-db/2304_hg38_svanna/ -t HP:0008330 --vcf bcf_15.vcf.gz --output-format vcf,tsv,html

   _____       ___
  / ___/_   __/   |  ____  ____  ____ _
  \__ \| | / / /| | / __ \/ __ \/ __ `/
 ___/ /| |/ / ___ |/ / / / / / / /_/ /
/____/ |___/_/  |_/_/ /_/_/ /_/\__,_/

Structural Variant Annotation and Analysis
                          :: v1.0.3 ::

14:39:02.022 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Using 1 phenotype features supplied via CLI
14:39:02.030 [main] INFO  o.m.svanna.cli.cmd.SvAnnaCommand - Spooling up SvAnna v1.0.3 using resources in /cluster/lab/gcooper/software/svanna-db/2304_hg38_svanna
14:39:12.047 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Reading variants from `bcf_15.vcf.gz`
14:39:12.116 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Read 8 variants
14:39:12.116 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants with reciprocal overlap >80.0% occurring in more than 1.0% probands
14:39:12.116 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants where ALT allele is supported by less than 3 reads
14:39:18.442 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritizing 8 variants on 2 threads
14:39:18.559 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritization finished in 0m 0s (115 ms) processing on average 69.57 items/s
14:39:18.559 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Writing out the results
14:39:18.562 [main] INFO  o.m.s.c.w.t.TabularResultWriter - Writing tabular results into /cluster/home/jlawlor/test_svanna/bcf_15.SVANNA.tsv.gz
14:39:18.602 [main] INFO  o.m.s.c.writer.html.HtmlResultWriter - Writing HTML results to /cluster/home/jlawlor/test_svanna/bcf_15.SVANNA.html
14:39:19.305 [main] INFO  o.m.s.cli.writer.vcf.VcfResultWriter - Writing VCF results into /cluster/home/jlawlor/test_svanna/bcf_15.SVANNA.vcf.gz
14:39:19.333 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - We're done, bye!

bcftools v1.16 compressed output:


$ bcftools sort -o bcf_16.vcf.gz example.vcf
$ tabix bcf_16.vcf.gz
$ java -jar /cluster/lab/gcooper/software/svanna-cli-1.0.3/svanna-cli-1.0.3.jar prioritize --data-directory /cluster/lab/gcooper/software/svanna-db/2304_hg38_svanna/ -t HP:0008330 --vcf bcf_16.vcf.gz --output-format vcf,tsv,html

   _____       ___
  / ___/_   __/   |  ____  ____  ____ _
  \__ \| | / / /| | / __ \/ __ \/ __ `/
 ___/ /| |/ / ___ |/ / / / / / / /_/ /
/____/ |___/_/  |_/_/ /_/_/ /_/\__,_/

Structural Variant Annotation and Analysis
                          :: v1.0.3 ::

14:35:49.175 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Using 1 phenotype features supplied via CLI
14:35:49.183 [main] INFO  o.m.svanna.cli.cmd.SvAnnaCommand - Spooling up SvAnna v1.0.3 using resources in /cluster/lab/gcooper/software/svanna-db/2304_hg38_svanna
14:35:58.690 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Reading variants from `bcf_16.vcf.gz`
14:35:58.747 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Read 0 variants
14:35:58.748 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants with reciprocal overlap >80.0% occurring in more than 1.0% probands
14:35:58.748 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Filtering out the variants where ALT allele is supported by less than 3 reads
14:36:04.356 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritizing 0 variants on 2 threads
14:36:04.360 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Prioritization finished in 0m 0s (3 ms) processing on average 0 items/s
14:36:04.361 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - Writing out the results
14:36:04.364 [main] INFO  o.m.s.c.w.t.TabularResultWriter - Writing tabular results into /cluster/home/jlawlor/test_svanna/bcf_16.SVANNA.tsv.gz
14:36:04.396 [main] INFO  o.m.s.c.writer.html.HtmlResultWriter - Writing HTML results to /cluster/home/jlawlor/test_svanna/bcf_16.SVANNA.html
14:36:04.697 [main] INFO  o.m.s.cli.writer.vcf.VcfResultWriter - Writing VCF results into /cluster/home/jlawlor/test_svanna/bcf_16.SVANNA.vcf.gz
14:36:04.721 [main] INFO  o.m.svanna.cli.cmd.PrioritizeCommand - We're done, bye!
james-lawlor commented 1 year ago

bcf_15.vcf.gz bcf_16.vcf.gz example_16.vcf.gz example_17.vcf.gz example_17.SVANNA.vcf.gz example_16.SVANNA.vcf.gz bcf_16.SVANNA.vcf.gz bcf_15.SVANNA.vcf.gz

Here are some example files and svanna output: bcf_15 and bcf_16 - as above

example_16 - sorted and piped to bgzip v 1.16, was read by svanna example_17 - sorted and piped to bgzip v 1.17, was not read by svanna

ielis commented 1 year ago

Hi @james-lawlor thanks for pointing out the issue in such a detailed way!

First of all, I replicated your issue with the :arrow_up: files on my end. Thanks to your description, I checked if there is anything wrong with the way how SvAnna handles gzipped files and there indeed was an issue.

SvAnna uses Apache commons-compress library to read gzipped VCF files. This is due to some SV callers producing invalid VCF files (at least during the time of main development). I do not fully understand the details of decompression, but it looks like the recent htslib change can be adressed by adjusting an option in commons-compress. After changing the GzipCompressorInputStream to decompressed concatenated files, the example_17.vcf.gz works OK. I will include the changes in the next release.

If the issue is time-sensitive, you can either build the app from source on the dependency-update branch, or I can build you a release ZIP and share here.

Thanks again for reporting the bug and please let me know if there are any other issues.

ielis commented 1 year ago

The issue should be fixed with the latest release v1.0.4. Please reopen if the latest release does not resolve it on your end.

Thanks a lot and all the best.. :)