Closed rhpvorderman closed 11 months ago
UBAM support would be great. I haven’t really worked with data in that format, but I would expect it to be much nicer to work with than FASTQ. Only having to deal with one file for paired-end data sounds great, for example (but irrelevant for Nanopore of course).
It seems that you’re mainly interested in reading uBAMs, is that correct?
I agree this should be quite straightforward to implement. I wrote a (partial) BAM parser in pure Python a couple of years ago and remember it was really not hard.
yes. Read-only. The real problem in supporting bam lies in:
I just heard from someone at the LGTC (Leiden Genome Technology Center) that the default nanopore caller that they use, dorado, always outputs in uBAM. So inclusion here does make sense. Minimap2 does not support uBAM, so I am planning to support a flow where I use cutadapt to do the necessary adapter trimming, size and quality selection and convert it into FASTQ in one go.
This would be a really valuable addition
Ah, looks to have been added but still unreleased, cool!
@rhpvorderman Would you be fine with me preparing a 1.1.0 release with uBAM support?
My plan is to add tag support later and to add all tags to te comment part. That will be a behavorial change. I don't think it qualifies as breaking, so I am fine with it.
Release done.
I have been thinking about tag support. The idea was that information from the sequencer is kept for later post-processing files after cutadapt.
In this particular context, I have been thinking about sequali. But I realised that the information that is normally contained in the metadata is no longer relevant after various quality improvement steps have been applied.
So I think it is better to take the sequali approach and only parse specific tag fields as needed by upstream programs. If there is any information that is normally in FASTQ headers that cutadapt can use it should be parsed, if not, the tag should be left as is.
Currently I have no specific fields that I can think of that need parsing. At least not with the kind of sequencing data I have to process regularly.
What exactly do you mean with tag support? It sounds as if you were thinking about pre-defined tags that dnaio would interpret.
Cutadapt only has the --discard-casava
filter that discards reads with :Y:
in the header. TBH, I’d be fine if this specific filter were just not supported for uBAM input.
So how would one access the BAM tags? Perhaps we need a BamRecord
that derives from SequenceRecord
and that has a way to access the tags (maybe a tags
attribute that works like dict but parses tags on demand).
So how would one access the BAM tags?
Just copy them in the comment part, samtools fastq style. But only for tags that are worth it, otherwise the "name" string will be much bigger than necessary. Currently I do not see tags for which this is necessary.
Nanopore reads can be delivered in uBAM support. While a full-fledged BAM processor is a fool's errand (I have been there...) it is actually quite straightforward to just parse the name, sequence and qualities from a uBAM file.
This will add uBAM support to cutadapt. uBAM is very annoying anyway as minimap2 won't accept it. It will be nice of cutadapt can take care of the conversion while also trimming away any nanopore helper sequences.