samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

Consider 64-bit values in BCF, BAM and CRAM #735

Open jkbonfield opened 11 months ago

jkbonfield commented 11 months ago

I'm being lazy and didn't feel like making 3 issues, but obviously they'll be 3 PRs and if people agree that this is something we wish to move forward then we can spin out format specific issues for discussion.

For background see https://github.com/samtools/bcftools/issues/1961. While this is perhaps just an abuse, as it's redundant data and fundamentally it's an identifier and not really needing to be an enumeration with any meaningful ordering (so should be a string), it raises the thought that maybe we need 64-bit data elements in place before we get an issue that requires them right that instant.

The text formats have limits applied only out of interoperability for their binary counterparts, and indeed htslib already supports longer values for some of the fields in SAM (and limited writing out as BAM when present).

jmarshall commented 11 months ago

Java does not really have unsigned (see the first paragraph of https://github.com/samtools/hts-specs/pull/460#issuecomment-565134858), so we should probably only consider adding representations for int64_t, not uint64_t as well. (Surely 63 bits of magnitude is enough for anyone! 😄)

So e.g. for BAM that would mean e.g. just l (signed “long” int64_t) and maybe d (double).

jkbonfield commented 11 months ago

Fair comment - I really hope we don't have a need for full 64-bit unsigned values.