harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

bcftools-1.12 loses intervals at the sort_gather_vcfs step #106

Closed osipovarev closed 9 months ago

osipovarev commented 1 year ago

I noticed some of the intervals are missing from my final vcf file and I tracked at which step exactly the intervals are being dropped and it is the sort_gather_vcf step. I tried the bcftools concat command outside the pipeline, and it seems the version of bcftools affects the result:

$ conda activate bcftools_test
$ bcftools -v
bcftools 1.12
$ bcftools concat -D -a -Ou filtered_L0001.vcf.gz filtered_L0796.vcf.gz  | bcftools sort -T  -Oz -o  concat_test.12.vcf  - 
$ wc -l concat_test.10.vcf concat_test.12.vcf
    58440 concat_test.12.vcf

=> NO gap interval in the concat_test.12.vcf

changing bcftools version to 1.10:

$ conda activate broodp
$ bcftools -v
bcftools 1.10
$ bcftools concat -D -a -Ou filtered_L0001.vcf.gz filtered_L0796.vcf.gz  | bcftools sort -T  -Oz -o  concat_test.10.vcf  - 
$ wc -l concat_test.10.vcf
    69081 concat_test.10.vcf

=> YES, the gap interval is in the concat_test.10.vcf

In both cases the log files are identical and have no errors:

Checking the headers and starting positions of 2 files
Writing to -Oz
Merging 1 temporary files
Cleaning
Done

Here I attach two input vcf files and the gap interval is: CM051112 7946416 8384147 filtered_L0796.vcf.gz filtered_L0001.vcf.gz

tsackton commented 1 year ago

This is probably related to https://github.com/samtools/htslib/issues/1623