Open jwimberl opened 2 years ago
It's been pointed out to me immediately that this is expected behavior as the length of the reference means that it overlaps the right flanking region in the third BED file, as opposed to my assumption that the VCF indexing would be strictly on the indicated starting position of the variant.
Correct, VCF indices, like BAM, have overlap calculations rather than point sources, so you can find all variants that overlap a region.
In bcftools, if you use a combination of a broad region (-R
, which uses indices) and a target list (-T
, which is a streaming filter) then the duplicates will be removed. Bcftools also has a --regions-overlap
options to control pos vs overlap decisions, but I've never used it so I don't know how appropriate it is.
Edit: just remembered this is tabix, which likely doesn't have the same options, but bcftools will - obviously - also view VCF files. :-)
I'm not really clear on whether multi-region iterators work in general or just for reads in BAM/CRAM files — e.g., there's an hts_itr_regions()
and a sam_itr_regions()
but no bcf_itr_regions()
or tbx_itr_regions()
and bcftools doesn't appear to use anything of the sort.
Anyway, if they do work in general, then this issue would be solved by tabix
having an option to use multi-region iterators. And if they don't, then perhaps one day they should. :smile:
Thank you both for your helpful suggestions -- using bcftools in place of tabix is an option for me, and I've confirmed that specifying the BED format regions file together with --regions-overlap record
is equivalent to the running the tabix query and removing duplicates.
Using tabix version 1.9 on CentOS 7
I am querying a public indexed dataset (gnomad annotations from the Broad institute) using tabix together with a regions file in BED format, and found that that in some cases a matching row may be returned multiple time. Two cases in which no duplicates of a particular annotation at position 1007743 are returned are when the BED file contains a single region (
region.bed
)or a disjoint region on the left (
region_flL.bed
):However, adding a disjoint region on the right leads to the issue (
region_flR.bed
):The following script reproduces the issue (assuming AWS credentials have been loaded into the appropriate environment variables):
Output: