samtools / htslib

C library for high-throughput sequencing data formats
Other
805 stars 446 forks source link

`bcftools view` with region filter drops variants at POS=0 (telomeres) #1571

Open colin-nolan opened 1 year ago

colin-nolan commented 1 year ago

Given a bgzipped VCF that contains a telomere indicated by position 0 (as per the VCF spec):

$ zcat example.vcf.gz | grep -v '#'
22      0       ID1     C       G       .       PASS    .       GT      1/0
22      1       ID2     C       G       .       PASS    .       GT      1/0

(Disclaimer: I know next to nothing about telomeres, and have really only ended up here due to invalid, non-int positions getting converted to 0, as discussed in https://github.com/samtools/htslib/issues/1570. My example undoubtedly makes no realistic sense!)

A simple bcftools view works as expected:

$ bcftools view -H  example.vcf.gz
22      0       ID1     C       G       .       PASS    .       GT      1/0
22      1       ID2     C       G       .       PASS    .       GT      1/0

Applying a region filter however unexpectidly drops the telomere (albeit with a warning referring to a flag that I don't think exists in this context?):

$ bcftools view -H --regions 22 example.vcf.gz
[W::tbx_parse1] Coordinate <= 0 detected. Did you forget to use the -0 option?
22      1       ID2     C       G       .       PASS    .       GT      1/0
$ echo $?
0

It seems like the telomere should be included with this filter, given its chr; and I haven't encountered anywhere in the docs that indicates the observed behaviour. However, there certainly could be some biological reason why that doesn't make sense!

The way it gets dropped can cause an interesting issue when considred with https://github.com/samtools/htslib/issues/1570, as it means variants where (for whatever reason...) POS gets corruped, get converted to 0 and then dropped when a pipe goes on to have a region filter!

pd3 commented 1 year ago

What is the version of bcftools you are using? I am unable to reproduce with the latest

colin-nolan commented 1 year ago

I'm seeing this with a build of the latest version (BCFtools fe98a6b54536246ce650ad5fde0b75e99bb22fa0; htslib 70df0479771fa28e927c01de4c5ca95f1e63ebba), on Ubuntu 22.04 with:

cd htslib
autoreconf --install
./configure
make

cd ../bcftools
autoreconf
./configure
make

Also the same with 1.17.

$ bcftools view -H --regions 22  example.vcf.gz
[W::tbx_parse1] Coordinate <= 0 detected. Did you forget to use the -0 option?
22      1       ID2     C       G       .       PASS    .       GT      1/0
$ bcftools --version
bcftools 1.17-4-gfe98a6b
Using htslib 1.17-1-g70df047
Copyright (C) 2023 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
pd3 commented 1 year ago

Just to confirm, I was able to reproduce the problem now

$ echo -e "##fileformat=VCFv4.2\n##contig=<ID=22>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n22\t0\tID0\tC\tG\t.\tPASS\t.\n22\t1\tID1\tC\tG\t.\tPASS\t." | bcftools view -o rmme.vcf.gz

$ bcftools index -f rmme.vcf.gz
[W::tbx_parse1] Coordinate <= 0 detected. Did you forget to use the -0 option?

$ bcftools --version
bcftools 1.17-2-gdb8d6573
Using htslib 1.17-1-g70df047

The BCF format works

$ echo -e "##fileformat=VCFv4.2\n##contig=<ID=22>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n22\t0\tID0\tC\tG\t.\tPASS\t.\n22\t1\tID1\tC\tG\t.\tPASS\t." | bcftools view -o rmme.bcf

$ bcftools index -f rmme.bcf
pd3 commented 4 months ago

I don't think this has been completed

colin-nolan commented 4 months ago

I don't think this has been completed

Apologies! I spent a few minutes yesterday follow up my old PRs/issues - but I was clearly too trigger happy here when I saw that commits had been added referencing this ticket.