Add ubam support - Githubissues

rhpvorderman commented 1 year ago

Nanopore reads can be delivered in uBAM support. While a full-fledged BAM processor is a fool's errand (I have been there...) it is actually quite straightforward to just parse the name, sequence and qualities from a uBAM file.

Qualities are encoded with offset 0. This is a simple copy and add 33 manoeuver.
Sequences are encoded, but can be converted back with a simple lookup table.
Name can simply be copied.
Bam is basically a bgzip format. Since the indexing feature is not needed, a simple gzip reader can stream the entire file. Python-isal will read this quite fast.

This will add uBAM support to cutadapt. uBAM is very annoying anyway as minimap2 won't accept it. It will be nice of cutadapt can take care of the conversion while also trimming away any nanopore helper sequences.

marcelm commented 1 year ago

UBAM support would be great. I haven’t really worked with data in that format, but I would expect it to be much nicer to work with than FASTQ. Only having to deal with one file for paired-end data sounds great, for example (but irrelevant for Nanopore of course).

It seems that you’re mainly interested in reading uBAMs, is that correct?

I agree this should be quite straightforward to implement. I wrote a (partial) BAM parser in pure Python a couple of years ago and remember it was really not hard.

rhpvorderman commented 1 year ago

yes. Read-only. The real problem in supporting bam lies in:

Supporting tags which have various data formats which gives a lot of switch and branching statements in the code.
CIGAR string can have a length longer than 65536 which BAM does not support. This is annoying while reading and writing. Luckily, focusing on unaligned bam read-only alleviates this problems. Since dnaio is essentially a FASTQ library, I think supporting only unaligned BAM is warranted. In theory we could also add write support by writing an uBAM record (also quite easy) but I don't know of upstream aligners that support uBAM, so that would not be very useful.

rhpvorderman commented 1 year ago

I just heard from someone at the LGTC (Leiden Genome Technology Center) that the default nanopore caller that they use, dorado, always outputs in uBAM. So inclusion here does make sense. Minimap2 does not support uBAM, so I am planning to support a flow where I use cutadapt to do the necessary adapter trimming, size and quality selection and convert it into FASTQ in one go.

bede commented 1 year ago

This would be a really valuable addition

bede commented 1 year ago

Ah, looks to have been added but still unreleased, cool!

marcelm commented 1 year ago

@rhpvorderman Would you be fine with me preparing a 1.1.0 release with uBAM support?

rhpvorderman commented 1 year ago

My plan is to add tag support later and to add all tags to te comment part. That will be a behavorial change. I don't think it qualifies as breaking, so I am fine with it.

marcelm commented 1 year ago

Release done.

rhpvorderman commented 1 year ago

I have been thinking about tag support. The idea was that information from the sequencer is kept for later post-processing files after cutadapt.

In this particular context, I have been thinking about sequali. But I realised that the information that is normally contained in the metadata is no longer relevant after various quality improvement steps have been applied.

So I think it is better to take the sequali approach and only parse specific tag fields as needed by upstream programs. If there is any information that is normally in FASTQ headers that cutadapt can use it should be parsed, if not, the tag should be left as is.

Currently I have no specific fields that I can think of that need parsing. At least not with the kind of sequencing data I have to process regularly.

marcelm commented 1 year ago

What exactly do you mean with tag support? It sounds as if you were thinking about pre-defined tags that dnaio would interpret.

Cutadapt only has the --discard-casava filter that discards reads with :Y: in the header. TBH, I’d be fine if this specific filter were just not supported for uBAM input.

So how would one access the BAM tags? Perhaps we need a BamRecord that derives from SequenceRecord and that has a way to access the tags (maybe a tags attribute that works like dict but parses tags on demand).

rhpvorderman commented 1 year ago

So how would one access the BAM tags?

Just copy them in the comment part, samtools fastq style. But only for tags that are worth it, otherwise the "name" string will be much bigger than necessary. Currently I do not see tags for which this is necessary.

marcelm / dnaio

Add ubam support #109