samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
655 stars 172 forks source link

Represent U in SAM rather than T for reads from RNA #801

Open Psy-Fer opened 3 weeks ago

Psy-Fer commented 3 weeks ago

Before this change https://github.com/samtools/htslib/pull/1854 U was changed to N when read by samtools

Now it will be changed to T

However, I think it would be "better" if we could preserve U in SAM, even when moving SAM->BAM->CRAM->SAM for example.

There is a problem, however, that there is no room in the 4bits BAM uses to represent all 16 IUPAC bases (where T is for T and U).

A solution to this raised by @jmarshall could be to allocate a FLAG bit to indicate an alignment record is RNA, which would then mean the T coming from a BAM, would be written as a U when viewed in SAM.

This would also mean most tools would still work, while building for the future of RNA sequencing methods to represent the base that is actually being measured.

Another solution (though more ad-hoc and less "good") would be to make yet another sam tag, to denote the read is from RNA. This saves using a FLAG bit, but adds more complexity to the solution.

Cheers, James

jkbonfield commented 3 weeks ago

This is possible, but it's something to raise at hts-specs. Htslib is just one tool implementing the widely agreed specification standards, and it won't help the community to fork this. If we add an extra FLAG bit then it needs adding to the specification and then all implementations need to agree and implement it. Hence this wouldn't be a quick fix.

A private space SAM tag wouldn't break the specification at least, but it does still introduce different behaviour in different tools so should really get discussed in the GA4GH meetings too. Can you please raise an issue there regarding how to indicate RNA vs DNA?

An alternative would simply be to add a header field so the user is aware that this is an RNA library and all Ts are Us. This feels like a safer option that doesn't risk breaking downstream processing tools. It could naturally fit into the @RG header with minimal impact beyond that. Software could then decide whether to apply the T->U translation based on that header.

jkbonfield commented 3 weeks ago

Ah I see https://github.com/samtools/hts-specs/issues/800 already, which is related. I should have finished reading all my emails. :)

jkbonfield commented 3 weeks ago

Good point - I forgot I can just transfer things. :)

Edit: now transferred.

jmarshall commented 3 weeks ago

The context for this issue is this thread. On brief consideration, it seems to me that a header field indicator would not work brilliantly when merging a U file and a T file — not that anyone should be merging DNA and RNA BAMs perhaps! Hence the suggestion to indicate this per-read.

Since @Psy-Fer mentioned that hybrid cases (reads containing both U and T) would usefully be representable, I'm perhaps leaning more towards a tagged field rather than a flag bit. This could be akin to the methylation tags, e.g. an array tag identifying the T bases that are actually U bases, along with some succinct special value indicating that all of this read's Ts are actually Us.

But the flag bit idea does have an attractive simplicity…

jkbonfield commented 3 weeks ago

I should also raise CRAM here.

BAM we know only stores data in nibbles and it's simply impossible for it to represent U as U. It must be T and use a side-channel to flag it.

CRAM stores bases as characters, so technically it can store anything there. However, it's also (ideally) referenced based encoding for aligned data. It would be a tragic mistake if for RNA we started storing an edit for each and every U vs T in the genome! Given we need a side-channel mechanism for BAM, I would advocate the same mechanism is applied with CRAM. Ie we present T to the encoder and, if desired, have the decoder translate back on-the-fly.

Note that "if desired" is key. There are many tools out there that cannot accept U. Minimap2 may, but from what I can see bwa doesn't. I imagine this is true of a lot of aligners where they just do some hashing technique and may not have remembered to add U=T into the table. So automatically round tripping and giving the same data back may not always be the solution that is most useful to the user. This also leads us to the side-channel approach, even for SAM.

My personal preference would be for updating the @RG header as I don't personally see the need for a flag given it's a fundamental property of the entire library as far as I can tell. Hypothetically we could have sequencing instruments which can on-the-fly disambiguate U from T. Theoretically ONT could, but I assume you specify it front which DNA / RNA calibration you're using when base calling as it'd considerably improve the accuracy of calls and also remove impossibilities such as calling RNA with a smattering of DNA bases or vice versa.

jkbonfield commented 3 weeks ago

Since @Psy-Fer mentioned that hybrid cases (reads containing both U and T) would usefully be representable, I'm perhaps leaning more towards a tagged field rather than a flag bit. This could be akin to the methylation tags, e.g. an array tag identifying the T bases that are actually U bases, along with some succinct special value indicating that all of this read's Ts are actually Us.

@Psy-Fer can you please give more information on the hybrid case. How does this arise? Is it biologically relevant, or just an artifact of a calling pipeline?

Psy-Fer commented 3 weeks ago

Hybrid cases are not really an issue, only thinking in terms of possibilities in the future.

Currently dRNA sequencing with ONT for example has a DNA adapter with RNA strand. But the basecaller doesn't basecall the adapter, so you never actually get those sequences. This doesn't mean that would always be the case though, and just something ONT has decided. Anyone that works with ONT will note that anything can change, so I was just pointing it out as something to keep in mind after it was raised by David Eccles on mastodon. There are also other companies that are coming out with tech that can sequence RNA directly, so I'm thinking getting this kind of thing sorted now is in everyone's best interest before weird ad-hoc solutions start getting created.

As for a not on the @RG tag, you can have multiple RGs, and when mergin, you will get the RGs from both files, and each read links to its own unique RG (if the names are unique, which they hopefully are). So yes, this could potentially be a nice way of doing it.

jkbonfield commented 3 weeks ago

I'm not sure practically speaking it matters much about library construction here for SAM/BAM et al. The adapters would get trimmed away anyway pre alignment. If we consider an unaligned file then there needs to be aditional meta-data stored already to indicate the number of bases to trim etc, so that can be used too to indicate the adapter is DNA if needed.

Normally in Illumina world we move data from the sequence (eg barcodes) to aux tags prior to storing in unaligned BAM, so I would expect a similar thing to happen here, meaning the SAM field is only still the template that the user wanted sequencing. Pragmatically speaking, that's the way to go to leverage all the existing tools as aligners etc no longer need to be aware of platform specific adapters as SEQ is pre-trimmed for them.

Edit: there are some technologies where we circularise and do a long read across the junction, so SEQ -> adapter -> SEQ, with the adapter aware aligner then recognising this. I guess in theory such data could then go from RNA to DNA to RNA, but pragmatically I doubt it matters much as far as analysis goes given aligners treat both T and U the same anyway. Also, such library techniques are doing this because they don't know where the junction is and it's inferred later on, so they're not going to be writing out a mix of U and T as they don't know where one part finishes and the next starts.

jmarshall commented 3 weeks ago

Yes, header-wise I was imagining something on the @HD line. But an attribute on the @RG lines makes more sense and side-steps the merging problem very nicely. :+1:

michaelmhoffman commented 1 week ago

Keep in mind that epigenetic sequencing assays often do things like convert certain DNA nucleosides to deoxyuridine. The best-known of these is bisulfite sequencing. Now bisulfite sequencing is usually followed by PCR which changes all the deoxyuracils into deoxythymines but with the single-molecule sequencing techniques around these days, I wouldn't want to assume that you're never going to want to distinguish between thymine and uracil in the same read.

jkbonfield commented 1 week ago

That's a good point.

However fundamentally it breaks BAM. BAM simply cannot encode U. So much goes via BAM (including htslib's core in-memory format due to where it came from), that not treating U as T is tantamount to simply treating U as N. Allowing SAM to have greater specificity than BAM was, err, a "courageous decision". Or rather it was likely simply a sloppy translation from Apple Pages format to LaTeX. (I think @jmarshall commented on this during the last meeting.)

So my take on this is we treat U as T in the SEQ column, and if it matters for our particular assay then we add base modification tags to those Ts to describe them as U. That's about the only backwards compatible approach I can think of that captures the information. It's doable right now via the existing base modification tags and a CHEBI code, but maybe we should make an exception here and add a short code for it given it's a fundamentally well understood and accepted letter.

There is a secondary topic of what should decoders do. If they see a record with the PG field indicating it's RNA, then in theory they should decode all the T bases as U again. However that may well cause breakage in other software, so I think it should be permitted behaviour to not do that and let the user decide.

michaelmhoffman commented 1 week ago

The base modification treatment is a good idea. I would also support a U short code.

In short, I don't think you need to make any effort to enable distinguishing U and T in the same read now, but you may want to during the life of these standards. A Lindy effect estimate might be ~15 years from now, or 2039. So I would avoid baking in further assumptions that this won't happen. (Unless doing so provides great benefit!)