samtools / htslib

C library for high-throughput sequencing data formats
Other
809 stars 446 forks source link

Don't error when making an iterator on a tid not in the index #1545

Closed daviesrob closed 1 year ago

daviesrob commented 1 year ago

Don't error when making an iterator on a tid not in the index, instead return one that instantly finishes.

Fixes #1534, which was an edge case where an index did not include an entry for a chromosome that was mentioned in the file header but had no data records. Normally these would be present but empty, but it was possible to use the IDX= key in a VCF file to make an index where the chromosome simply did not appear. In this case, rather than an error, we want to return the equivalent of HTS_IDX_NONE so the iterator produces no data.

Another scenario where this is useful is if you build an index, and then try to use it immediately without first saving and reading it back in again. Such an index will have NULL entries in bidx[] for any chromosomes with no data. Again we want to return an HTS_IDX_NONE iterator if one of those chromosomes is queried. (This issue didn't usually occur because most programs are loading in an existing index, and idx_read_core() makes bidx[] entries for everything even if there's nothing in the index for the chromosome.)

Note that this changes vcf_loop() in test_view.c so that it now treats bcf_itr_querys() failures as an error. The new behaviour matches sam_loop() and is needed to detect the problem being fixed here. All the other tests still work after this change so nothing was relying on the old behaviour of ignoring the errors.

jkbonfield commented 1 year ago

Merged as it works and fixes a problem, but I do wonder if there are more things wrong than this in #1534.

It sounds like we have a way to create an index for a BCF file where chromosomes don't appear in the index. This PR makes us no longer fail with such indices, but it doesn't fix the underlying problem that we still have indices with missing chromosomes. Our code works, but what about other peoples code? Do we know whether Picard would work on such data without fixing our indices to include them?

Is it a hard requirement that indices must contain an entry for every chromosome in the header, irrespective of whether they have any coverage? I looked at the VCF spec but it has a woeful single sentence describing indices! (Basically "same as BAM" which is a bit misleading.) The BAM spec doesn't explicitly state that all references have BAI entries, but experimentally it does appear to be the case. If BCF eqvuialent doesn't do this then perhaps it is still in error.

daviesrob commented 1 year ago

Having index entries with no data is normal (n_bin in the entry would be zero; look at the diagram in the samtools spec. for context). The difference in this case is that the tid value was higher than n_ref, so it wouldn't appear in the list of indicies. It's possible that Picard might not like that, but you do have to try quite hard to make an index that does this so it's unlikely many exist in practice - and as the way found involved indexing a file with no variant records such files are unlikely to be much use.

https://github.com/samtools/htslib/issues/1534#issuecomment-1351496991 mentions that we might want to fix the index building side. I could try to make a PR for that if you'd like a belt-and-braces approach.