Extended BAM support - Githubissues

rhpvorderman commented 9 months ago

As discussed in https://github.com/marcelm/cutadapt/issues/757 adding BAM support to dnaio would have several advantages for cutadapt. The alternative is to defer to pysam instead, and keep dnaio at its basic level.

In order for full BAM support I think the following things need to be added:

full tag support
BAM header support
BAM write support
- BAM header support
- BGZF format support
paired-end support

Converting a SequenceRecord object to binary BAM bytes, is something where I see no technical difficulties. A BAM header is also not verry difficult as it is just a SAM header with a few binary quirks.

BGZF is going to be a slight hurdle as it cannot be appropriately done with xopen, so direct library calls to zlib/libdeflate/isa-l are needed. Since only streamed writing is required it is actually quite easy to write a RawWriter that creates the BGZF blocks and wrap that in an io.BufferedWriter that has a 65200 byte buffer size. That will automatically take care of the sizing and will also be pretty fast.

Tag support is really tough. There are several options here:

Keep BAM file format tags and store that as a hidden bytes object or memory buffer which can be manipulated by calls. Unfortunately this requires parsing the format every time an operation is performed.
Simply use a python dict. This is massively costly in terms of space as well as runtime.
Use a custom format that allows easier manipulation than file format tags while being faster to manipulate than a python dict. On top of that, how do you design the API for manipulating the tags?

Then there is also paired end support. It is possible to put some limits on what is supported. Name-sorted files come to mind. Enabling full support requires supporting indexes and that is also a lot of work. In that case pysam is a better option.

rhpvorderman commented 9 months ago

A lot of the problems with tags can be solved with an implementation using the following guidelines.

Tags always have a two-letter code. Hence they can be stored in two bytes each. By creating an array of only tag codes, 32 of these can fit on one CPU cache line (64 bytes). As a result this array can be searched extremely quickly. If a tag is found, the index in the array can be used for accessing a value array.
All tag values fit in an 8-byte sequence, except arrays and strings. Array's and strings need a pointer, for 64-bit architectures, that conveniently also fits in a 8-byte sequence.

So an implementation with a tag array and a value array (containing typecode and value) seems efficient as well as easily convertable to and from BAM format.

Removing entries is simple: simply zero the entry. Make it so that zero entries are not written. Adding entries is also simple. Check if there are 0-entries (the tag array needs to be scanned anyway to see if the tag already exists) and overwrite them. If not make the arrays a little bigger. Replacing tags is also easy: overwrite them.

This is not going to be trivial to implement, but at least it as an implementation that will work well and scale well. (Albeit using slightly more memory than the htslib implementation, which simply keeps the file format representation in memory).

It is also easily possible in this format to selectively not support manipulating the tags array with every type code. What I mean is that of course dnaio should be able to correctly parse all tag typecodes as well as write them back properly, but support for adding new tags can be limited. For example the 'Z' typecode for strings will probably the most-used one by cutadapt (if not the only one) so that can be enabled. The same goes for reading typecodes. Sure it is not great that the implementation is not complete, but it allows for spreading the work over multiple releases and support as needed.

Sorry, I had to write all this down, otherwise I can't stop thinking about it and start working on my other tasks :-).

rhpvorderman commented 8 months ago

One of the most costly things in terms of performance is memory allocations. So it is a good thing to do these as little as possible. Unfortunately there are some variable length items in BAM file tags such as arrays and strings. This cannot be solved without dynamic memory allocation.

However with a bit of organization this can be remedied. For the value store I originally intented the following layout:

struct TagValue {
  uint8_t type[2];
  uint8_t value[8];
}

With value being cast to a pointer to a dynamically allocated piece of memory in case of 'B' or 'Z' type tags.

It is however much more efficient to allocate only a single piece of memory for all the B and Z tags. The value part of TagValue can then be cast to a uint32_t offset and uint32_t length. This comes with the slight inefficiency of having to resize the block when adding or manipulating values. This is however preferable to allocating lots of small pieces of memory that can cause severe memory bucket fragmentation.

Another question is whether to store tags and values separate or together. Also the following struct can be used

struct Tag {
  uint8_t tag[2];
  uint8_t type[2];
  uint8_t value[8];
}

That saves allocating one array, at the cost of fast tag contains checking. I prefer keeping the fast check.

So implementation wise the following members would need to be added to the SequenceRecord struct

uint32_t number_of_tags;
uint32_t array_store_size;
uint8_t *tags;
struct TagValue *tag_values;
uint8_t *array_store;

That is 32 bytes extra to sequence record. In my opinion this is acceptable.

Then we need to think about how to get and set tags. I think the following signatures are appropriate:

get_tag(tag: str) -> Union[int, float, str, array.array]: ...
set_tag(tag:str, value: Union[int, float, str, array.array]) ->  None: ...
get_tags() -> Dict[str, Union[int, float, str, array.array]]: ...

Yes this mimicks the pysam API. I think they made a pretty good design choice there. Setting multiple tags simultaneously or worrying about value_type seems to me to be beyond the scope of dnaio. No, it does not use numpy arrays. Numpy arrays are only efficient on quite large arrays as they have tremendous overhead. As well as introducing a massive dependency to the project. For arrays I am quite conflicted on whether to use memoryview (a type that when iterated over gives the correct results!) or array. On Cython using array.array is easier than when using native C.

I guess a good solution for now is to not support get_tags and simply throw a "NotImplementError" when get_tag encounters a B tag type code.

peterjc commented 4 months ago

See https://github.com/biopython/biopython/blob/master/Bio/bgzf.py for a possibly useful BGZF implementation available under the 3-clause BSD license. I think that'd let you vendor that file (bundle it as it within the MIT licensed dnaio), since adding a dependency on the whole of Biopython might not be tempting.

rhpvorderman commented 4 months ago

Thanks! Since only streaming is supported some shortcuts can be taken. An io.BufferedWriter with a buffersize of 65200 can be used to write to a RawBgzipWriter that will simply take any writes, compresses them and enclose them in a BGZF header and footer. BGZIP streaming is already supported since that can already be handled by normal gzip readers. This would grant the required amount of BGZF support with very little extra code.

I think dnaio can provide value over pysam by providing limited support for only a few BAM features as they are used in uBAM. That allows for a much cleaner interface whilst also being a lot faster. Full BAM and BGZF support including indexing will require lots of extra code, and is not really of value for uBAM files (as they come from the sequencer). In use cases where it is needed I think using native htslib bindings such as pysam are a better option.

billytcl commented 3 weeks ago

Hi all, has there been any movement on this front? We would love for cutadapt to be compatible with bam files.

marcelm commented 3 weeks ago

dnaio already supports reading BAM single-end files (#109). This issue is about improving the support in a couple of ways. What would you need most?

billytcl commented 3 weeks ago

Whoops! I thought I was on the cutadapt page. I'm looking to search for specific adapter sequences in my aligned bam files.

On Wed, Sep 25, 2024 at 10:00 PM Marcel Martin @.***> wrote:

dnaio already [supports reading BAM single-end files)(#109 https://github.com/marcelm/dnaio/issues/109). This issue is about improving the support in a couple of ways. What would you need most?

— Reply to this email directly, view it on GitHub https://github.com/marcelm/dnaio/issues/129#issuecomment-2375874634, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYTYERP6IR6VIDORHTYTZYOIILAVCNFSM6AAAAABJDPY6ECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZVHA3TINRTGQ . You are receiving this because you commented.Message ID: @.***>

marcelm commented 3 weeks ago

Cutadapt delegates reading and writing sequences to dnaio, so discussing this particular aspect is ok here.

That said, do you have a concrete idea of how searching for adapters in already aligned BAM files should work or is this more a general feature suggestion? I expect that any decent read mapper soft-clips adapter sequences, so I would usually not expect this to be so useful. Feel free to open an issue in the Cutadapt repository to discuss this aspect.

rhpvorderman commented 3 weeks ago

@billytcl This would pose technical challenges for dnaio as paired-end reads are generally not mapped at the same position. So this prohibits streaming. This means indexed reading support needs to be added.

Currently my plan is to integrate as much BAM features as needed to fully support ONT uBAM files as created by dorado. Or maybe single-end libraries for aligned BAM. Integrating more features would simply mean creating an alternative version of PySAM, which has two problems:

PySAM already exists, so this is a mere duplication of effort.
Getting to 100% BAM support is at least five times the effort of only supporting a limited set of features.

Trust me, I have tried. I broke off that project because getting CRAM support, support for indexed reads etc. was so much effort, that it would be better just to use htslib than to create a native python library. But that project already exists, PySAM.

Then there is the issue that trimming after the alignment is just the wrong order of doing things. The adapters do affect the alignment score and the placement of the reads. So you have badly aligned reads with adapters. Just trimming them gives badly aligned reads. In your case I recommend using samtools fastq to extract the reads, then cut the adapters and then align them again.

rhpvorderman commented 1 week ago

Note to self: Cutadapt also delegates trimming to dnaio, as SequenceRecord objects can be sliced. I want to properly slice tags. The dorado implementation can be found here. https://github.com/nanoporetech/dorado/blob/c3a2952356e2506ef1de73b0c9e14784ab9a974a/dorado/demux/Trimmer.cpp#L127

marcelm / dnaio

Extended BAM support #129