samtools / htslib

C library for high-throughput sequencing data formats
Other
783 stars 447 forks source link

Filter line always added to the VCF header when opening a file for writing #1782

Closed nh13 closed 1 month ago

nh13 commented 1 month ago

See title.

This comes from using pysam:

vcf_header = VariantHeader()
vcf = VariantFile("test.vcf", "w", header=vcf_header)
vcf.close()

I would expected the fileformat line and that's it in the header.

But we always get the following filter line:

##FILTER=<ID=PASS,Description="All filters passed">

My best guess is that it comes from bcf_hdr_init

The following is a work-around:

vcf_header = VariantHeader()
vcf = VariantFile("test.vcf", "w", header=vcf_header)
vcf.header.filters.remove_header("PASS")
vcf.close()

Any reason why we always need to add this filter (the ship may have sailed, but curiosity killed the cat)?

jmarshall commented 1 month ago

I believe this is the HTSlib implementation's way of ensuring BCF's requirement that PASS be encoded as 0. (See the VCF spec §6.2.1, “Dictionary of strings”.)

pd3 commented 1 month ago

That is exactly right. HTSlib makes the implicit PASS filter explicit to prevents problems, so this is working as intended.

nh13 commented 1 month ago

Thank-you, this is super helpful, especially the link to §6.2.1!