samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

Modified base single letter codes update #741

Closed marcus1487 closed 4 months ago

marcus1487 commented 10 months ago

Looking at the modified base tags section there are a limited number of modified bases found in the specifications page. At nanopore we are working on a number of other modified bases and wondering whether we might be able to expand or clarify the specification of these single letter modified base codes. Namely we are interested in 4mC in DNA and m6A in RNA.

For 4mC I am not aware of an accepted single letter code, but this is one of the most common bacterial methylation marks, so it would seem to be one of the highest priority modified bases to get a single letter code. If a single letter code can be agreed upon it would be great to add this to the specification. Alternatively we can use the ChEBI code for 4mC (21839).

For m6A in RNA it would be good to add to the specification whether we should use the a code reserved for 6mA in DNA in the current specification or whether a new code should be adopted to avoid confusion with the DNA single letter code. The ChEBI code would be the same as the chemical structures are identical for DNA 6mA and RNA m6A, so using the ChEBI code does not seem like a good idea given that there is a single letter code (a) for this ChEBI code. Not sure what is the "correct" thing to do here, but looking for thoughts from the community.

jkbonfield commented 10 months ago

I'm not wanting hts-specs to become the arbiter of such things when there are entire (much better staffed and supported) groups that are handling this already. We also run the risk of inventing a code ourselves and then ChEBI inventing a different code, or reusing our code for another base type. Long term this wouldn't aid anyone. Rather we should just track and mirror the official ChEBI nomenclature instead. If there are short codes documented there that aren't in our spec, then I think it's fine to add them in. If there is something missing that needs adding, it should be raised with ChEBI itself to go via their channels first. You make a compelling case for 4mC so I'd hope they will consider it.

There was discussion at some point about creating local codes, where we could put a code in a header with ChEBI ID and then refer to that code within the data, but it adds complexity, potentially huge when merging files, extending headers is hard to do given the state of a lot of software, and ultimately it saves very little. Rather it may be best to simply use the header comment fields to do the reverse - document the ChEBI codes so people looking at the data can see what it is without having to hunt down the definitions.

marcus1487 commented 10 months ago

For the arbiters of the single letter code I completely understand the reasoning to avoid this, but in the absence of a pointer to the arbiter the table in the SAM tags spec sort of becomes the de facto arbiter. In fact looking at the 5mC ChEBI page I don't see a specific mention as m for the single letter code. The only place I know of that specifies the single letter codes would by @michaelmhoffman 's DNA mods database. I'm not sure if Michael would like to claim the role of arbiter for DNA single letter codes or if there is another source for these codes that we could use as the arbiter of single letter codes.

The other issue here is not around whether one can determine the modified base of interest, but with how much ease one can identify the modified base. When using ChEBI codes, most users will interact with these codes in a genome browser and see integer labels for the various modifications of interest. They would then have to refer back to the SAM header or look up the ChEBI code to figure out which modified base this is. So there is certainly some added value to the modified base single letter code for being used for the most common bases. I'm happy to make the case for 4mC where it would carry the most weight, but this does not seem to be ChEBI to me.

The annotation of the modified base codes used, even for single letter codes, makes sense in the SAM header comment lines. We are aiming to include this in output formats at nanopore.

marcus1487 commented 10 months ago

@jkbonfield Or others, do you have any thoughts on this topic? It seems that ChEBI is not quite up to the task for this specification. Can you suggest where we might submit a request to have these single letter codes updated?

jkbonfield commented 9 months ago

Sorry for the slow reply. No, unfortunately I don't know who is the best here. Our original table was taken from the Viner et.al. paper (https://www.biorxiv.org/content/10.1101/043794v1), which was basically written by experts in the field. None of the SAM maintainers are qualified to be deciding on this sort of thing, and even if we were, we'd just run the risk of forking things and causing multiple nomenclatures to appear.

I, apparently wrongly, assumed that the other short codes would make their way into ChEBI as that's also referred to in the paper, but it may not have happened. I still think ChEBI feels like the natural place to go to, but all I can advise is speaking directly with them or Coby Viner to ask how to get new codes accepted. We're happy to follow the community consensus here, but it does need consensus first.

jkbonfield commented 7 months ago

I think some of this could also be addressed by the genome browsers. For example when they see a base mod 21839, they could turn it into a tooltip to https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:21839. We could obviously add some comments to the SAM headers, but structure comments feel error prone and it doesn't really solve anything as a genome browser won't be looking there without having an update, in which case pointing to ChEBI instead feels like the more natural fix.

I've also pinged @michaelmhoffman regarding whether there is a way to add new codes to the DNA mods database.

jkbonfield commented 7 months ago

GA4GH File Formats isn't willing to be the maintainer for such things and the view from upstream is that it's too premature to add new short codes for these, so for now all I can recommend is adding @CO tags to annotate the SAM file for humans.

jkbonfield commented 4 months ago

Closing this as "not planned", for now at least. We don't have the appropriate skills in GA4GH for maintaining such a database, so we'll just follow the community / upstream portals. Ie ChEBI or DNA Mods DB. If new things appear in there, please do raise a ticket for us to add them to our specifications.