samtools / htslib

C library for high-throughput sequencing data formats
Other
804 stars 446 forks source link

Allow repeated calls of bcf_sr_set_regions #1624

Closed pd3 closed 1 year ago

pd3 commented 1 year ago

and make repeated bcf_sr_seek()+next_line() calls consistent.

Resolves #1623 and https://github.com/samtools/bcftools/issues/1918

pd3 commented 1 year ago

In past we tried to use the indexes to print some stats and realized they don't carry the same information. I don't remember the details, but it had something to do with contig names not being part of the tbi index, I think.

Therefore for this issue I concluded in the end that the return status cannot be made fully equivalent for both and looked for a different solution.

jkbonfield commented 1 year ago

If you are proposing that it is correct that BCF and VCF have different return values for certain situations, then they should be documented. I can see no information on the expected return values in this PR.

Personally though, I think they should be massaged to be equal as far as is possible, with the potential (documented) caveat for VCF files that have no header. They aren't the norm from what I can see.

daviesrob commented 1 year ago

I thought that might be the case. Although it may be possible to get more consistency by adding this change:

diff --git a/synced_bcf_reader.c b/synced_bcf_reader.c
index 702f260..fc48c03 100644
--- a/synced_bcf_reader.c
+++ b/synced_bcf_reader.c
@@ -521,7 +521,8 @@ static int _reader_seek(bcf_sr_t *reader, const char *seq, hts_pos_t start, hts_
     if ( reader->tbx_idx )
     {
         int tid = tbx_name2id(reader->tbx_idx, seq);
-        if ( tid==-1 ) return -1;    // the sequence not present in this file
+        if ( tid==-1 )               // the sequence not present in this index
+            return reader->header && bcf_hdr_name2id(reader->header, seq) >= 0 ? 0 : -1;
         reader->itr = tbx_itr_queryi(reader->tbx_idx,tid,start,end+1);
     }
     else

So as long as the header has ##contig entries it will behave the same as with BCF, and if it doesn't have any then it'll be no worse off than before. If it returns at this point, reader->itr will be NULL, but that should be OK.

Here's the test program output with that change:

/tmp/test_synced_reader /tmp/test.vcf.gz
seek: 0
      1
seek: 0
      1
seek: 0
      1
daviesrob commented 1 year ago

I expanded the original test a bit, which showed up another issue. The updated VCF file is:

##fileformat=VCFv4.2
##reference=file://some/path/human_g1k_v37.fasta
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  a
17  1   .   T   G   .   .   .   GT  0/0
19  2   .   C   A   .   .   .   GT  0/0
19  3   .   G   C   .   .   .   GT  0/0
19  4   .   T   C   .   .   .   GT  0/1
19  5   .   A   C   .   .   .   GT  1/1

and the test program now looks like this:

#include <stdio.h> 
#include <stdarg.h> 
#include "htslib/synced_bcf_reader.h"

void error(const char *format, ...) 
{
    va_list ap;
    va_start(ap, format);
    vfprintf(stderr, format, ap);
    va_end(ap);
    exit(-1);
}

void print_next_line(bcf_srs_t *sr)
{
    int count = bcf_sr_next_line(sr);
    bcf1_t *line = count > 0 ? bcf_sr_get_line(sr, 0) : NULL;
    fprintf(stderr, "count %d\n", count);
    if (line) {
      fprintf(stderr, "  tid %d pos %"PRIhts_pos"\n", line->rid, line->pos);
    } else {
      fprintf(stderr, "  NULL\n");
    }
}

int main(int argc, char *argv[])
{   
    bcf_srs_t *sr = bcf_sr_init();
    sr->require_index = 1;
    if ( bcf_sr_add_reader(sr,argv[1])!=1 ) error("Unable to read %s\n",argv[1]);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"17",0));
    print_next_line(sr);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"18",0));
    print_next_line(sr);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"19",0));
    print_next_line(sr);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"20",0));
    print_next_line(sr);
    bcf_sr_destroy(sr);
    return 0;
}

So it tries to access a missing chromosome 18, which is no longer at the start of the file, and it also simulates fetching the next line and printing out the tid and position for the record returned. Running it on the original (test.vcf.gz, test.bcf) and new (test2.vcf.gz, test2.bcf) files gives the following:

./test_synced_reader test.vcf.gz
seek: -1
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 2 pos 1

./test_synced_reader test.bcf
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 2 pos 1

./test_synced_reader test2.vcf.gz
seek: 0
count 1
 tid 0 pos 0
seek: -1
count 1
 tid 0 pos 0
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 0 pos 0

./test_synced_reader test2.bcf
seek: 0
count 1
 tid 0 pos 0
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 0
  NULL

So when bcf_sr_seek() can't find any data for the requested seq, on bcf it starts reading from the next reference, but on vcf.gz it goes back to the start of the file, which probably isn't ideal. I don't think it's actually easily possible to make them behave in exactly the same way due to differences in the indexes. It is possible to make the vcf.gz one return no data for absent sequences though with the following change:

diff --git a/synced_bcf_reader.c b/synced_bcf_reader.c
index 702f260e..a6f901a8 100644
--- a/synced_bcf_reader.c
+++ b/synced_bcf_reader.c
@@ -872,7 +872,12 @@ int bcf_sr_seek(bcf_srs_t *readers, const char *seq, hts_pos_t pos)
     //  - find the requested iseq (stored in the seq_hash)
     //  - position regions to the requested position (bcf_sr_regions_overlap)
     bcf_sr_seek_start(readers);
-    if ( khash_str2int_get(readers->regions->seq_hash, seq, &i)>=0 ) readers->regions->iseq = i;
+    if ( khash_str2int_get(readers->regions->seq_hash, seq, &i)>=0 ) {
+        readers->regions->iseq = i;
+    } else {
+        // Ensure we don't try to continue on from a non-existent reference seq
+        readers->regions->iseq = -1;
+    }
     _bcf_sr_regions_overlap(readers->regions, seq, pos, pos, 0);

     for (i=0; i<readers->nreaders; i++)

The result is then:

./test_synced_reader test.vcf.gz
seek: -1
count 0
  NULL
seek: -1
count 0
  NULL
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 0
  NULL

./test_synced_reader test.bcf
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 0
  NULL

./test_synced_reader test2.vcf.gz
seek: 0
count 1
 tid 0 pos 0
seek: -1
count 0
  NULL
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 0
  NULL

./test_synced_reader test2.bcf
seek: 0
count 1
 tid 0 pos 0
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 0
  NULL

So when bcf_sr_seek() fails, bcf_sr_next_line() more reliably returns no data. I decided that trying to get exactly the same results from BCF and VCF isn't going to be easy, so I haven't included the other patch above. Both HTSlib and BCFtools are happy when running make check with this additional change. @pd3 Do you think this looks reasonable?

pd3 commented 1 year ago

Following up on our discussion offline, the bcf_sr_seek() call is currently used only in consensus.c, its use now deprecated by https://github.com/samtools/bcftools/pull/1937. And in vcfconcat.c https://github.com/samtools/bcftools/blob/4d0d262b020386b940ec7f60f636097fa1d4cc3a/vcfconcat.c#L501.

There I don't immediately see if it can be replaced with the updated bcf_sr_set_regions() function or not. It is used only in the phased concatenation mode, see the tests with the -l, --ligate option https://github.com/samtools/bcftools/blob/4d0d262b020386b940ec7f60f636097fa1d4cc3a/test/test.pl#L712-L719

daviesrob commented 1 year ago

Thanks. I'll see if I can come up with something that exercises the missing chromosome case a bit better. Given the prefect overlap requirement for the --ligate option, would it be correct that you'd drop parts where one of the files didn't have data for one of the chromosomes, unless using --ligate-force? Also, should you be able to swap the two input files, and get the same result?

pd3 commented 1 year ago

When two chunks overlap, i.e. share a genomic region, within that region all sites found in one file must be present also in the other or else they are dropped. I believe the program makes the assumption that the VCFs come in the correct order. However, it shouldn't be difficult to reorder to make it robust against swapping of two input files.

daviesrob commented 1 year ago

The file ordering certainly makes a difference for vcf. Here's a couple of test inputs, derived from concat.4.a.vcf and concat.4.b.vcf](https://github.com/samtools/bcftools/blob/develop/test/concat.4.b.vcf):

concat.6.a.vcf:

##fileformat=VCFv4.2
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1   .   A   C   .   .   .   GT  0|1
chr1    2   .   C   G   .   .   .   GT  0/1
chr2    1   .   A   C   .   .   .   GT  0|1
chr2    2   .   C   G   .   .   .   GT  0/1
chr3    1   .   A   C   .   .   .   GT  0|1
chr3    2   .   C   G   .   .   .   GT  0/1

concat6.b.vcf:

##fileformat=VCFv4.2
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1   .   A   C   .   .   .   GT  1|0
chr1    2   .   C   G   .   .   .   GT  0/1
chr1    3   .   G   T   .   .   .   GT  0|1
chr3    1   .   A   C   .   .   .   GT  1|0
chr3    2   .   C   G   .   .   .   GT  0/1
chr3    3   .   G   T   .   .   .   GT  0|1

Trying bcftools concat -l both ways round, I get:

./bcftools concat -l /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.b.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.17-53-g4d0d262b+htslib-1.17-39-g7f69840c
##bcftools_viewCommand=view -Oz -o /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.a.vcf; Date=Thu Jun 15 14:52:15 2023
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phasing Quality (bigger is better)">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##bcftools_concatVersion=1.17-53-g4d0d262b+htslib-1.17-30-g3e2190c9-dirty
##bcftools_concatCommand=concat -l /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.b.vcf.gz; Date=Fri Jun 16 11:59:00 2023
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1   .   A   C   .   .   .   GT:PS   0|1:1
chr1    2   .   C   G   .   .   .   GT:PQ:PS    0/1:99:1
chr1    3   .   G   T   .   .   .   GT:PS   1|0:1
chr3    1   .   A   C   .   .   .   GT:PS   0|1:1
chr3    2   .   C   G   .   .   .   GT:PS   0/1:1
chr3    3   .   G   T   .   .   .   GT:PS   1|0:1

./bcftools concat -l /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.a.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.17-53-g4d0d262b+htslib-1.17-39-g7f69840c
##bcftools_viewCommand=view -Oz -o /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.b.vcf; Date=Thu Jun 15 14:52:21 2023
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phasing Quality (bigger is better)">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##bcftools_concatVersion=1.17-53-g4d0d262b+htslib-1.17-30-g3e2190c9-dirty
##bcftools_concatCommand=concat -l /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.a.vcf.gz; Date=Fri Jun 16 11:59:28 2023
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1   .   A   C   .   .   .   GT:PS   1|0:1
chr1    2   .   C   G   .   .   .   GT:PS   0/1:1
chr1    3   .   G   T   .   .   .   GT:PS   1|0:1
chr3    1   .   A   C   .   .   .   GT:PS   0|1:1
chr3    2   .   C   G   .   .   .   GT:PS   0/1:1
chr3    3   .   G   T   .   .   .   GT:PS   0|1:1
chr2    1   .   A   C   .   .   .   GT:PS   0|1:1
chr2    2   .   C   G   .   .   .   GT:PS   0/1:1

So the second is incorrectly ordered compared to the header, and also includes chr2 which the first left out.

I think this comes from how the references are added to the synced reader regions list. Switching to bcftools concat -a, which basically just pulls reads from the synced reader, gives (with a few headers removed to make it easier to read):

./bcftools concat -a /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.b.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1   .   A   C   .   .   .   GT  0|1
chr1    1   .   A   C   .   .   .   GT  1|0
chr1    2   .   C   G   .   .   .   GT  0/1
chr1    2   .   C   G   .   .   .   GT  0/1
chr1    3   .   G   T   .   .   .   GT  0|1
chr2    1   .   A   C   .   .   .   GT  0|1
chr2    2   .   C   G   .   .   .   GT  0/1
chr3    1   .   A   C   .   .   .   GT  0|1
chr3    1   .   A   C   .   .   .   GT  1|0
chr3    2   .   C   G   .   .   .   GT  0/1
chr3    2   .   C   G   .   .   .   GT  0/1
chr3    3   .   G   T   .   .   .   GT  0|1

./bcftools concat -a /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.a.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    1   .   A   C   .   .   .   GT  1|0
chr1    1   .   A   C   .   .   .   GT  0|1
chr1    2   .   C   G   .   .   .   GT  0/1
chr1    2   .   C   G   .   .   .   GT  0/1
chr1    3   .   G   T   .   .   .   GT  0|1
chr3    1   .   A   C   .   .   .   GT  1|0
chr3    1   .   A   C   .   .   .   GT  0|1
chr3    2   .   C   G   .   .   .   GT  0/1
chr3    2   .   C   G   .   .   .   GT  0/1
chr3    3   .   G   T   .   .   .   GT  0|1
chr2    1   .   A   C   .   .   .   GT  0|1
chr2    2   .   C   G   .   .   .   GT  0/1

While if you try with bcf, you'll find that the ordering is the same as the first case for both ways round. The code that adds new entries to the regions list is in bcf_sr_add_reader(). For the vcf files, it's using tbx_seqnames() to get a list of names out of the index and merging them in with any it already has. The tabix style index only lists sequences mentioned in the variant records, which explains the difference. For the a,b ordering the first file lists all of the names and it works as you'd expect. For the b,a ordering, concat.6.b.vcf.gz only adds chr1 and chr3; then when concat.6.a.vcf.gz is added to the reader it finds then chr1 and chr3 are already present, but chr2 is new so it gets added at the end.

I think it would help to get a more consistent output if the reader always used bcf_hdr_seqnames() to add names from the header, and then supplemented them with ones found with tbx_seqnames() that might have sneaked through without a ##contig line. It should also add any new ones it finds in the index to the file header that's going to be output in the same way as the VCF reader does it in add_missing_contig_hrec(). Hopefully the result would then both preserve the global ordering as much as possible, and also output a header with the ##contig lines in an order that matches the data lines that come later.

(Unfortunately even with this, bcftools concat -l still gives a different answer for the two orderings. But I'm not sure exactly what it should do, or if these are even strictly valid inputs.)

In other news, I've confirmed that this fix, with or without my tweak, makes the example in samtools/bcftools#1918 work with bcftools consensus even without having merged samtools/bcftools#1937.

pd3 commented 1 year ago

So the second is incorrectly ordered compared to the header, and also includes chr2 which the first left out.

Just a quick note before analyzing the entire comment: in VCF the order of contigs in the header does not have to match the order of contigs in the body, so this bit is perfectly fine.

daviesrob commented 1 year ago

I did wonder if that might be the case. However, I think there's something to be said for using the ordering in the header if available as it makes vcf and bcf inputs behave in a more similar way. And it may also help the aim of getting bcf_sr_seek() to work better when some chromosomes are absent from some of the input files.

daviesrob commented 1 year ago

I've looked more at bcftools concat -l. I think this check to see if it should add a new reader may be a bit broken because it doesn't check that the chromosome for line matches the one it found when setting args->start_pos[args->ifname]. However if it did come across such an input then it would have already missed all the overlaps on the first chromosome in the new file. I also think this can only happen if you've got at least three input files, so it might be best to give up if it happens as you're no longer properly synchronised.

Other than that, I don't think it's possible for bcftools concat -l to ask bcf_sr_seek() to go anywhere other than deliberately to the start of the file, or to the first chromosome (which it already knows exists). So inconsistencies in its handling of chromosomes that don't appear in any variant records probably aren't going to cause too much trouble, at least in bcftools.

As the bcf_sr_set_regions() change does work, I'll merge this with my comment fix, and sort out other synced reader issues elsewhere.