samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
641 stars 174 forks source link

Storing definitions for custom tags used in SAM file #710

Open cmdcolin opened 1 year ago

cmdcolin commented 1 year ago

I was wondering if there was a way or specification for SAM headers to describe what custom tags they are using, for example the lower case and X/Y/Z prefixed tags. My angle on this is just showing users at a glance what various fields mean in a genome browser, but can imagine it being useful in other circumstances.

VCF kind of has this with e.g. "1.4.4 Individual format field format" which will allow a file to self-describe the custom fields in it's FORMAT column

It could possibly make it easier to at-a-glace for a human to understand a data file. possible caveats

examples of CSQ and ANN ``` ##INFO= ##INFO= ```
jkbonfield commented 1 year ago

I like this idea, but sadly currently it doesn't exist.

It'd need to be in the @CO tag to avoid breaking existing parsers that validate the headers, at least until that mythical time we develop SAM 2.0. That's not ideal, but we are where we are.

I guess we could carve out a namespace within CO for additional commentary. Eg:

@CO @TAG    ID:X0   TY:i    DS:Number of best hits

You're perfectly at liberty to start doing this already, although it'd obviously need buy-in from the genome browsers. I'm not sure we'd want to add something formal to the specification unless we see active buy-in from multiple implementations.