samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
285 stars 243 forks source link

Accept U as a valid base type and convert to T for BAM. #1728

Open jkbonfield opened 3 weeks ago

jkbonfield commented 3 weeks ago

Before you submit

Make sure your issue is not already in the htsjdk issue tracker

It was discussed in https://github.com/samtools/htsjdk/issues/1478, now closed, but see below

Description of the issue:

The background to this is https://github.com/samtools/samtools/issues/2131 where a user has a SAM file containing U. Certainly we know ONT write fastqs with U (if RNA) and when they write BAM they translate them to T, but I'm still not entirely sure where this SAM came from. It broke htslib, and would also break htsjdk.

Your environment:

I checked the source, so largely irrelevant, but yeah our system supported Java is too old for the latest picard. :/

Steps to reproduce

Create a SAM file with U in the sequence. SamFormatConverter from SAM to BAM fails with:

Exception in thread "main" java.lang.IllegalStateException: Bad base passed to charToCompressedBaseLow: #(35) in read: r1

Expected behaviour

I'd like it to encode U as T in BAM so the data at least can be round-tripped. The user has the option of doing a T to U substitution on decode if they wish later on. Ideally it'd be tracked in the meta-data somewhere too.

See https://github.com/samtools/hts-specs/issues/800 and https://github.com/samtools/hts-specs/issues/801 for context.

Given U is IUPAC, I feel it was an early accident to disallow it. I contast to #1478 I disagree that this is a base modification. There are still 4 base types, but DNA and RNA differ in the chemical structure for T vs U. We don't need to track which bases are T and which are U as it's simply a property of the material. Furthermore IUPAC doesn't permit this. It has no ambiguity codes for e.g. A/U. The only mention of U in the original IUPAC paper was for V listed as not-T or not-U. All other not-? codes are the original base type +1 char. It's clear they chose V over U due to the T/U issue, but it's also clear the authors basically treat T and U as interchangeable and we should do too.

FWIW, I've already made this change to htslib in a merged PR, but not yet in a release.

lbergelson commented 3 weeks ago

Interesting. I figured everyone just represented it as a T from the start. It's not clear to me why it's important to be able to store it as U in an RNA FASTQ or SAM file. It's already an abstraction, it just seems simpler to call it T and save us all the headache. Is there any use case where you could reasonably encounter both and it wouldn't be clear from the context which it is already? I'm unconvinced about the 'hybrid adapter' read that was mentioned.

The only scenarios I can think of where this might be important seem very exotic.

  1. ONT has a new technology that literally takes whole cells and and reads from RNA + DNA simultaneously without being able to tell which it is up front. That would be neat but I haven't heard of it.
  2. Exotic synthetic oligos, but in that case you probably need to be able to represent more things like heavily modified or synthetic bases. (And could be handled by base modification?)

If we need to disambiguate at read time I like the idea you had of tracking it in the read group header. That makes a lot more sense to me than using a FLAG bit. Sure, we have a few left, but why waste them on something useless?

It should be an easy change to allow reading in a FASTQ with U and converting it on the fly. I think we'll break lots of things if we want to natively represent U as an option in the SAMRecord by default.

lbergelson commented 3 weeks ago

Also, update your java! You're way out of warranty and probably using broken crypto libraries or something. :)

jkbonfield commented 3 weeks ago

Also, update your java! You're way out of warranty and probably using broken crypto libraries or something. :)

Lol you don't have to tell me! Our main farm machines are Ubuntu 22 LTS, which has OpenJDK 11 as the package. I've no idea why it's so ancient. The Ubuntu universe packages go up to OpenJDK 21 (or higher?) I think for Ubuntu 22 so we could (and should) be updating such things. Shrug. It's a pain to do all this yourself all the while, especially once you start getting dependencies etc. Tedium, but it's what a lot of groups do, or just run virtual machines / docker containers to avoid the outdated base OS. This all feels like needless effort.

As for U vs T, I couldn't agree more really. It amazed me we didn't silently support it from day 1 (it's in IUPAC afterall). Having said that, it's also surprising anyone really cares. I thought most people just used T as synonymous and go on with life :). Why make problems for yourself?

I think perhaps with ONT though the base caller does have discrete models for the chemical characteristics of U vs T being different, in order to improve accuracy, and consequentially they're probably outputting U vs T as a useful confirmation to the user that they chose the correct calibration files. I can see that does have a benefit, albeit small.

There is discussion on mixed DNA and RNA on the hts-specs issue so you should probably read it there, but in summary it's possible in theory due to DNA adapters added on to RNA, but whether we actually want to honour that in SAM/BAM is another thing. Generally we move adapters or barcodes over to tags by the time we've got to SAM, but it's not always the case in FASTQ hence they're perhaps being very meticulous and precise in their labelling.

PS. Yes I could totally imagine a basecaller with A, C, G, T, U bases in it and some autodetection based on whether most T/U end up T or U (possibly with a second pass to tighten up the calling with the correct base). It would be a neat application to distinguish the two on the fly.