samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
649 stars 240 forks source link

--regions/--targets does not return overlapping insertion when overlap is set to (1|2) #2028

Closed bryce-turner closed 10 months ago

bryce-turner commented 10 months ago

Using 1.18 for this example, but this behavior also exists in 1.17.

Using the following example.vcf:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##contig=<ID=chr13,length=114364328>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr13   10  .   A   ACCAAACT    .   PASS    DP=100

If I understand the documentation correctly we would expect bcftools view --targets chr13:11-20 --targets-overlap 1 example.vcf to return the record starting at POS 10 since the end of the ALT is within chr13:11-20. But instead the output is empty:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##contig=<ID=chr13,length=114364328>
##bcftools_viewVersion=1.18+htslib-1.18
##bcftools_viewCommand=view --targets chr13:11-20 --targets-overlap 1 example.vcf; Date=Thu Oct 26 16:36:18 2023
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

I've also tested this with --regions (after converting to vcf.gz and indexing of course) and had the same behavior:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##contig=<ID=chr13,length=114364328>
##bcftools_viewVersion=1.18+htslib-1.18
##bcftools_viewCommand=view -o example.vcf.gz -Oz example.vcf; Date=Thu Oct 26 16:44:38 2023
##bcftools_viewCommand=view --regions chr13:11-20 --regions-overlap 1 example.vcf.gz; Date=Thu Oct 26 16:45:15 2023
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

Similarly tested with --regions-overlap 2:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##contig=<ID=chr13,length=114364328>
##bcftools_viewVersion=1.18+htslib-1.18
##bcftools_viewCommand=view -o example.vcf.gz -Oz example.vcf; Date=Thu Oct 26 16:44:38 2023
##bcftools_viewCommand=view --regions chr13:11-20 --regions-overlap 2 example.vcf.gz; Date=Thu Oct 26 16:49:51 2023
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

Is this behavior expected? Also note that this is just a simple example, I can provide a more complex "real" example if needed.

If this behavior is expected, are there any methods for performing a similar process? Say we wanted to filter to variants that specifically overlapped the edge of an exon.

Let me know if there are any clarifications needed from my end!

bryce-turner commented 10 months ago

After some more testing, it looks like in this case we will require the END INFO field. For example, the following vcf works as expected:

##fileformat=VCFv4.4
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##contig=<ID=chr13,length=114364328>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr13   10      .       A       ACCCCCCCCC      .       .       END=19
$ bcftools view --targets chr13:18-50 --targets-overlap 1 another_example.vcf
##fileformat=VCFv4.4
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##contig=<ID=chr13,length=114364328>
##bcftools_viewVersion=1.18+htslib-1.18
##bcftools_viewCommand=view --targets chr13:18-50 --targets-overlap 1 another_example.vcf; Date=Fri Oct 27 16:51:59 2023
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr13   10      .       A       ACCCCCCCCC      .       .       END=19

Also worth noting the change in VCFv4.4:

  • Redefined END as the end position of the longest ALT allele. Note that END remains Number = 1.

Apologies for not digging deeper before submitting the issue!

jmarshall commented 10 months ago

Your variant is an insertion of several bases between reference positions 10 and 11. It does not affect the base at position 10 itself nor does it affect the base at position 11 itself. And it certainly doesn't affect the bases over at reference position 19 or so, so adding END=19 would not be an accurate representation of your variant.

Requesting chr13:11-20 does not return this variant with any setting of --targets-overlap because position 11 does not touch the variant.

The proper way to ask for a variant between reference bases 10 and 11 is to use an interval that includes both 10 and 11, i.e., includes the bases on both sides of the insertion you're interested in. Intuitively, such intervals are the ones that include the area between reference bases 10 and 11 — where your insertion is. An interval that starts at 11, or one that ends at 10, does not include this area.

So --targets chr13:10-11 --targets-overlap 2 (or any wider interval that contains both 10 and 11) will return this variant. (Modes 0 and 1 for --targets-overlap are, as per the documentation, more generous and less accurate: they would also return this variant for intervals that only include 10.)

bryce-turner commented 10 months ago

I appreciate the additional feedback! It sounds like the most correct solution is simply to expand the interval, instead of adding an inaccuracy to the variant record.