samtools / htslib

C library for high-throughput sequencing data formats
Other
804 stars 446 forks source link

Synced reader does not support regions with contigs including colons #1620

Closed brenochoa closed 1 year ago

brenochoa commented 1 year ago

The synced_bcf_reader currently (as of version 1.17) fails with an error when it is initialised via bcf_sr_regions_init using region strings containing contig names with colons, despite the fact that colons are no longer disallowed characters in contig names since VCFv4.3:

The VCF specification previously disallowed colons (‘:’) in contig names to avoid confusion when parsing breakends, but this was unnecessary. Even with contig names containing colons, the breakend mate position notation can be unambiguously parsed because the “:pos” part is always present.

The underlying _regions_init_string function will likely need to be updated to scan for colons from the back instead of the front to accommodate this change of spec.

pd3 commented 1 year ago

This is not easy to support. The function accepts regions in multiple formats (chr, chr:pos, chr:beg-end, chr:beg-) and with colons in contig names it is impossible to tell if x:1234 refers to the contig 'x' at position 1234 or to a contig named x:1234.

The only solution I can think of is to move the responsibility for resolving ambiguities like this to the user and require full intervals chr:beg-end in cases where chr is of the form STR:NUM, STR:NUM-, or STR:NUM-NUM.

brenochoa commented 1 year ago

Guess they should have thought better before allowing colons in contig names... Looks kinda necessary after all! 😅

jkbonfield commented 1 year ago

We solved this already though elsewhere, with additional notation such as {chra:b}:10-20, or even just parsing from the other end so the last colon is the one that is used. This doesn't work if you rather foolishly created a contig named "chr10:100-200" as that's ambiguous, but that's why we adding the curly brace notation.

The htslib APIs correct support this, so it's possibly simply an issue of synced reader doing its own parsing rather than using the official region parsing API. (I haven't looked.)

jmarshall commented 1 year ago

bcf_sr_regions_init is an unfortunate API that does not have the set of valid contig names available, so the algorithm in Appendix A of the SAM spec is not directly applicable.

However a simplified version of it could be used, and as @jkbonfield noted, the explicit delimiter notation suggested in the appendix could also be supported.

Or perhaps it should be superseded by an API function that is provided with the set of contigs in play in the file(s) to be read.