Closed daviesrob closed 1 year ago
Merged as it works and fixes a problem, but I do wonder if there are more things wrong than this in #1534.
It sounds like we have a way to create an index for a BCF file where chromosomes don't appear in the index. This PR makes us no longer fail with such indices, but it doesn't fix the underlying problem that we still have indices with missing chromosomes. Our code works, but what about other peoples code? Do we know whether Picard would work on such data without fixing our indices to include them?
Is it a hard requirement that indices must contain an entry for every chromosome in the header, irrespective of whether they have any coverage? I looked at the VCF spec but it has a woeful single sentence describing indices! (Basically "same as BAM" which is a bit misleading.) The BAM spec doesn't explicitly state that all references have BAI entries, but experimentally it does appear to be the case. If BCF eqvuialent doesn't do this then perhaps it is still in error.
Having index entries with no data is normal (n_bin
in the entry would be zero; look at the diagram in the samtools spec. for context). The difference in this case is that the tid
value was higher than n_ref
, so it wouldn't appear in the list of indicies. It's possible that Picard might not like that, but you do have to try quite hard to make an index that does this so it's unlikely many exist in practice - and as the way found involved indexing a file with no variant records such files are unlikely to be much use.
https://github.com/samtools/htslib/issues/1534#issuecomment-1351496991 mentions that we might want to fix the index building side. I could try to make a PR for that if you'd like a belt-and-braces approach.
Don't error when making an iterator on a tid not in the index, instead return one that instantly finishes.
Fixes #1534, which was an edge case where an index did not include an entry for a chromosome that was mentioned in the file header but had no data records. Normally these would be present but empty, but it was possible to use the
IDX=
key in a VCF file to make an index where the chromosome simply did not appear. In this case, rather than an error, we want to return the equivalent ofHTS_IDX_NONE
so the iterator produces no data.Another scenario where this is useful is if you build an index, and then try to use it immediately without first saving and reading it back in again. Such an index will have NULL entries in
bidx[]
for any chromosomes with no data. Again we want to return anHTS_IDX_NONE
iterator if one of those chromosomes is queried. (This issue didn't usually occur because most programs are loading in an existing index, andidx_read_core()
makesbidx[]
entries for everything even if there's nothing in the index for the chromosome.)Note that this changes
vcf_loop()
intest_view.c
so that it now treatsbcf_itr_querys()
failures as an error. The new behaviour matchessam_loop()
and is needed to detect the problem being fixed here. All the other tests still work after this change so nothing was relying on the old behaviour of ignoring the errors.