marcelm / dnaio

Efficiently read and write sequencing data from Python
https://dnaio.readthedocs.io/
MIT License
62 stars 9 forks source link

Add ubam support #109

Closed rhpvorderman closed 11 months ago

rhpvorderman commented 1 year ago

Nanopore reads can be delivered in uBAM support. While a full-fledged BAM processor is a fool's errand (I have been there...) it is actually quite straightforward to just parse the name, sequence and qualities from a uBAM file.

This will add uBAM support to cutadapt. uBAM is very annoying anyway as minimap2 won't accept it. It will be nice of cutadapt can take care of the conversion while also trimming away any nanopore helper sequences.

marcelm commented 1 year ago

UBAM support would be great. I haven’t really worked with data in that format, but I would expect it to be much nicer to work with than FASTQ. Only having to deal with one file for paired-end data sounds great, for example (but irrelevant for Nanopore of course).

It seems that you’re mainly interested in reading uBAMs, is that correct?

I agree this should be quite straightforward to implement. I wrote a (partial) BAM parser in pure Python a couple of years ago and remember it was really not hard.

rhpvorderman commented 1 year ago

yes. Read-only. The real problem in supporting bam lies in:

rhpvorderman commented 1 year ago

I just heard from someone at the LGTC (Leiden Genome Technology Center) that the default nanopore caller that they use, dorado, always outputs in uBAM. So inclusion here does make sense. Minimap2 does not support uBAM, so I am planning to support a flow where I use cutadapt to do the necessary adapter trimming, size and quality selection and convert it into FASTQ in one go.

bede commented 1 year ago

This would be a really valuable addition

bede commented 1 year ago

Ah, looks to have been added but still unreleased, cool!

marcelm commented 1 year ago

@rhpvorderman Would you be fine with me preparing a 1.1.0 release with uBAM support?

rhpvorderman commented 1 year ago

My plan is to add tag support later and to add all tags to te comment part. That will be a behavorial change. I don't think it qualifies as breaking, so I am fine with it.

marcelm commented 1 year ago

Release done.

rhpvorderman commented 1 year ago

I have been thinking about tag support. The idea was that information from the sequencer is kept for later post-processing files after cutadapt.

In this particular context, I have been thinking about sequali. But I realised that the information that is normally contained in the metadata is no longer relevant after various quality improvement steps have been applied.

So I think it is better to take the sequali approach and only parse specific tag fields as needed by upstream programs. If there is any information that is normally in FASTQ headers that cutadapt can use it should be parsed, if not, the tag should be left as is.

Currently I have no specific fields that I can think of that need parsing. At least not with the kind of sequencing data I have to process regularly.

marcelm commented 1 year ago

What exactly do you mean with tag support? It sounds as if you were thinking about pre-defined tags that dnaio would interpret.

Cutadapt only has the --discard-casava filter that discards reads with :Y: in the header. TBH, I’d be fine if this specific filter were just not supported for uBAM input.

So how would one access the BAM tags? Perhaps we need a BamRecord that derives from SequenceRecord and that has a way to access the tags (maybe a tags attribute that works like dict but parses tags on demand).

rhpvorderman commented 1 year ago

So how would one access the BAM tags?

Just copy them in the comment part, samtools fastq style. But only for tags that are worth it, otherwise the "name" string will be much bigger than necessary. Currently I do not see tags for which this is necessary.