Closed brenochoa closed 1 year ago
This is not easy to support. The function accepts regions in multiple formats (chr
, chr:pos
, chr:beg-end
, chr:beg-
) and with colons in contig names it is impossible to tell if x:1234
refers to the contig 'x' at position 1234 or to a contig named x:1234
.
The only solution I can think of is to move the responsibility for resolving ambiguities like this to the user and require full intervals chr:beg-end
in cases where chr
is of the form STR:NUM
, STR:NUM-
, or STR:NUM-NUM
.
Guess they should have thought better before allowing colons in contig names... Looks kinda necessary after all! 😅
We solved this already though elsewhere, with additional notation such as {chra:b}:10-20, or even just parsing from the other end so the last colon is the one that is used. This doesn't work if you rather foolishly created a contig named "chr10:100-200" as that's ambiguous, but that's why we adding the curly brace notation.
The htslib APIs correct support this, so it's possibly simply an issue of synced reader doing its own parsing rather than using the official region parsing API. (I haven't looked.)
bcf_sr_regions_init
is an unfortunate API that does not have the set of valid contig names available, so the algorithm in Appendix A of the SAM spec is not directly applicable.
However a simplified version of it could be used, and as @jkbonfield noted, the explicit delimiter notation suggested in the appendix could also be supported.
Or perhaps it should be superseded by an API function that is provided with the set of contigs in play in the file(s) to be read.
The synced_bcf_reader currently (as of version 1.17) fails with an error when it is initialised via
bcf_sr_regions_init
using region strings containing contig names with colons, despite the fact that colons are no longer disallowed characters in contig names since VCFv4.3:The underlying
_regions_init_string
function will likely need to be updated to scan for colons from the back instead of the front to accommodate this change of spec.