Efficient sequence alignment path data structure

Discussed in https://github.com/scikit-bio/scikit-bio/discussions/1973

@mortonjt @wasade Will appreciate your thoughts!

^{Originally posted by **qiyunzhu** March 17, 2024} Here I am describing the plan to implement a multiple and pairwise sequence **alignment path** data structure in scikit-bio. In summary: - Central idea: [run-length encoding](https://en.wikipedia.org/wiki/Run-length_encoding) (RLE) and [bit array](https://en.wikipedia.org/wiki/Bit_array). - It is detached from the original sequences. - It is highly memory efficient, compared with tabular alignments. - It permits fully vectorized operations, therefore time efficient. - It can be efficiently converted to/from various formats, such as CIGAR (supported by many alignment libraries), tabular (original format in scikit-bio), indices (Biotite), coordinates (Biopython), etc. I guess there might be similiar implementations in the past, although I haven't come across them through limited literature search. Will appreciate it if anyone can inform me of the relevant literature. A Jupyter notebook implementing a prototype of this data structure and relevant functionality is attached. [align_path.zip](https://github.com/scikit-bio/scikit-bio/files/14628752/align_path.zip) Will appreciate any feedback from the team and community! --- For example, we have a multiple sequence alignment (MSA) consisting of three aligned DNA sequences. ``` seq1: CGGTCGTAACGCGTA---CA seq2: CAG--GTAAG-CATACCTCA seq3: CGGTCGTCAC-TGTACACTA ``` ### Suggested solution I suggest the following solution, which is essentially a combination of [**run-length encoding**](https://en.wikipedia.org/wiki/Run-length_encoding) (RLE) and [**bit array**](https://en.wikipedia.org/wiki/Bit_array). Through reasoning, comparison, and benchmarking, I concluded that it is the most suitable for the variety of target applications of scikit-bio. We divide the alignment position-wise into **segments** based on gap status. Each segment represents a consecutive block that has a consistent gap status in all sequences. In this example, the alignment can be divided into seven segments. ``` seq1: CGG TC GTAAC G CGTA --- CA seq2: CAG -- GTAAG - CATA CCT CA seq3: CGG TC GTCAC - TGTA CAC TA ``` Calculate the **lengths** of segments: ``` lens: 3 2 5 1 4 3 2 ``` And the **gap status** of segments (`1` - gap, `0` - character): ``` gap1: 0 0 0 0 0 1 0 gap2: 0 1 0 1 0 0 0 gap3: 0 0 0 1 0 0 0 ``` The gap status can be encoded into bits: ``` gaps: 0 2 0 6 0 1 0 ``` Therefore, we only need two equal-length vectors: lengths, and gap status (bits), to represent an alignment path. ``` [3 2 5 1 4 3 2] [0 2 0 6 0 1 0] ``` ### Data type The length vector can be a simple integer array. The gap status vector is tricky. One can only pack up to 64 bits into a `uint64` integer, which limits the number of sequences in the alignment. Therefore, one needs to use `np.packbits` to convert the 0-1 array into a 2D `uint8` array. When the gap status needs to be accessed, one needs to use `np.unpackbits` to extract 0-1 values from the 2D array. This method ensures efficient storage, but creates some overhead in packing/unpacking. ### Pairwise alignment [Pairwise alignment](https://en.wikipedia.org/wiki/Sequence_alignment#Pairwise_alignment) (i.e., aligning two sequences) is probably the more important use case than aligning three or more sequences. A notable application is sequence similarity search, which is essential in working with high-throughput sequencing data. I noticed that the majority of alignment libraries only support pairwise alignment. The suggested solution is the most efficient with pairwise alignments. It is consistent with [**CIGAR**](https://en.wikipedia.org/wiki/Sequence_alignment#CIGAR_Format), the most commonly used representation of pairwise alignment. According to the [SAM specification](https://samtools.github.io/hts-specs/SAMv1.pdf): ``` Int Code Meaning 0 M (mis)match (no gap) 1 I insertion (gap in query) 2 D deletion (gap in reference) ... ``` - Note: In pairwise alignment, seq1 and seq2 are usually referred to as query and reference (or target, subject). In the aforementioned example, if we only consider the first two sequences: ``` seq1: CGGTCGTAACGCGTA---CA seq2: CAG--GTAAG-CATACCTCA ``` The CIGAR string is: ``` 3M2D5M1D4M3I2M ``` The suggested solution is: ``` [3 2 5 1 4 3 2] [0 2 0 2 0 1 0] ``` Some libraries use a similar data structure. For example, in [pysam](https://pysam.readthedocs.io/), a pairwise alignment is stored as [`cigartuples`](https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.cigartuples), which a list of tuples of (length, status) (i.e., the transpose of the suggestion solution). It is beneficial to have a separate `PairAlignPath` class, which is a subclass of `AlignPath`. This subclass has additional methods (such as `from_cigar`, `to_cigar`), and has further optimized parsers (since it doesn't need to pack/unpack bits). This subclass will serve as the bridge with multiple pairwise alignment libraries, such as [Parasail](https://github.com/jeffdaily/parasail) and [pyWFA](https://github.com/kcleal/pywfa). ### Applications The suggested solution may facilitate some downstream applications. #### Alignment score Calculation of alignment scores is a basic operation. Many alignment libraries do not have the capability of doing this for a user-provided alignment. Therefore, having this functionality in scikit-bio will create a platform for evaluating and benchmarking alignment methods. The suggested solution is suitable for alignment score calculation. Most importantly, gap penalties can be conveniently calculated based on the lengths of gap segments, without even knowing the original sequences. Various gap penalty schemes (linear, affine, dual affine, convex...) can be calculated based on this information. Meanwhile, match/mismatch scores can be calculated by converting the alignment path into the indices of characters in the original sequences that are part of the aligned regions (i.e., ignoring unaligned segments). Although instraightforward (and less performant than directly operating on the tabular MSA), this can still be vectorized, therefore its performance is okay. ```python def align_score(path, seqs): gap_segs = path.states != 0 n_gap_opens = gap_segs.sum() n_gap_extends = (gap_segs * path.lengths).sum() idx = path.to_indices(gap="del") # see below seq1, seq2 = (x._bytes[i] for x, i in zip(seqs, idx)) n_matches = (seq1 == seq2).sum() n_mismatches = seq1.size - n_matches ... ``` In my tests, this function outperforms a Python for loop on aligned sequences when the sequences are long. #### Multiple alignment Multiple sequence alignment is computationally difficult (an exact solution is exponential to sequence count). Therefore, modern algorithms rely on various heuristics to accelerate this process. The most commonly used method, [progressive alignment](https://en.wikipedia.org/wiki/Multiple_sequence_alignment#Progressive_alignment_construction), is a greedy algorithm that iteratively merges smaller alignments into larger ones. The smaller alignments will remain unchanged after they were constructed. That is, "once a gap, always a gap". (Although modern algorithms will refine gaps after progressive alignments.) I envision (guess) that progressive alignment may benefit from the suggestion solution, as it encodes gaps in separation from the original sequences, making it convenient to add gaps on top of gaps. However, I have not yet tested this. ### Format conversions #### Tabular MSA scikit-bio's current data structure, `TabularMSA`, stores gapped sequences in an alignment. This is probably the most intuitive representation, despite being highly inefficient. ``` ['CGGTCGTAACGCGTA---CA', 'CAG--GTAAG-CATACCTCA', 'CGGTCGTCAC-TGTACACTA'] ``` The suggested solution can convert from/to a tabular structure using efficient vectorized operations (assuming `msa` is a 2D array of bytes). ```python def from_tabular(): bits = (msa == gap_char) idx = np.where(np.diff(bits, prepend=np.nan))[0] lens, gaps = np.diff(idx, append=msa.shape[1]), bits[idx] ``` ```python def to_tabular(): gaps = np.repeat(bits, lens, axis=1) msa = np.full(bits.shape, gap_char, dtype=np.uint8) msa[gaps == 0] = np.concatenate(seqs) ``` #### Indices [Biotite](https://www.biotite-python.org/)'s [`Alignment`](https://www.biotite-python.org/apidoc/biotite.sequence.align.Alignment.html) object contains **sequences**, **trace**, and **score**. This design also separate original sequences and the alignment path ("trace"). For the trace, the documentation says: > The trace is a (_m_ x _n_) ndarray with alignment length _m_ and sequence count _n_. Each element of the trace is the index in the corresponding sequence. A gap is represented by the value -1. In the aforementioned example, the trace (after transposing) is: ``` [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, -1, -1, -1, 15, 16], [0, 1, 2, -1, -1, 3, 4, 5, 6, 7, -1, 8, 9, 10, 11, 12, 13, 14, 15, 16], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, 10, 11, 12, 13, 14, 15, 16, 17, 18]] ``` This data structure facilitates efficient vectorized operations, despite being less efficient in memory space (it consumes more space than the original sequences). It is convenient to convert between the suggested solution and the Biotite Alignment: ```python def to_indices(): pos = np.repeat(1 - bits, self.lengths, axis=1) idx = np.cumsum(pos, axis=1, dtype=int) - 1 indices = np.where(pos, idx, -1) ``` ```python def from_indices(): bits = indices == -1 ... ``` #### Coordinates [Biopython](https://biopython.org/)'s [`Alignment`](https://biopython.org/DIST/docs/tutorial/Tutorial.html#sec81) object consists of **sequences** and **coordinates**. The coordinates define the positions of segments in the original sequences. Specifically, each coordinate is the index of the first character in a segment, if this sequence contains this segment, or equal to the previous coordinate, if not. In the aforementioned example, the coordinates are: ``` [[0, 3, 5, 10, 11, 15, 15, 17], [0, 3, 3, 8, 8, 12, 15, 17], [0, 3, 5, 10, 10, 14, 17, 19]]) ``` This data structure also employs segments (the same as in the suggested solution). Therefore, it is also memory efficient (but less so than the suggested solution. The coordinates rather then lengths make it more convenient to locate individual segments, which is an advantage over the suggested solution. It is convenient to convert between the suggested solution and the BioPython Alignment: ```python def to_coordinates(): coords = np.append(np.zeros((nrow, 1)), lens * (1 - bits), axis=1).cumsum(axis=1) ``` ```python def from_coordinates(): gaps = (diff := np.diff(coords)) == 0 lens = diff[gaps.argmin(axis=0), np.arange(ncol)] ``` ### Limitation and alternatives The suggested solution is highly efficient when the number of sequences is small. However, its memory efficiency deleteriotes when there are many aligned sequences, because the segments (in each of which all sequences must have a unique gap status) will become highly fragmental. - Note: Even if all segments are 1, the memory consumption of the suggested solution (which is bits) is still at most **1/8** of the tabular MSA (which is bytes). #### Separate RLE In such scenarios, separate run-length encodings for individual sequences is a more memory efficient data structure. For example, the alignment: ``` seq1: CGGTCGTAACGCGTA---CA seq2: CAG--GTAAG-CATACCTCA seq3: CGGTCGTCAC-TGTACACTA ``` Can be represented as three vectors, in each of which numbers represent lengths of chars, gaps, chars, gaps, ...: ``` [15, 3, 2] [3, 2, 5, 1, 9] [10, 1, 9] ``` This data structure can be efficiently generated using the following code: ```python [np.diff(np.where(np.diff(x == gap_char, prepend=np.nan))[0], append=msa.shape[1]) for x in msa] ``` This data structure, despite memory efficient, may not support the efficient segment-based operations as discussed above. Therefore, it is only good for storage. Whether scikit-bio's target applications require storing large alignments in the memory is to be discussed. #### Tree-guided states If all sequences in an alignment can be related using a tree structure, then one can have one bit array at each node to represent the entire alignment path. This idea is aligned with the established algorithms for solving the multiple sequence alignment problem, particularly [progressive alignment](https://en.wikipedia.org/wiki/Multiple_sequence_alignment#Progressive_alignment_construction). These bit arrays are guaranteed to be not fragmental. Such a data structure may simultaneously suffice efficient memory and efficient operations (but with limitations). I haven't prototyped this data structure. Because it is too specific, I doubt if it is the desired data structure for scikit-bio. ### Summary The new alignment path class `AlignPath`, structured as run-length encoding (RLE) plus bit array, will permit time and memory efficiency in applications, and serve as a central hub for conversion between various data structures.

scikit-bio / scikit-bio

Efficient sequence alignment path data structure #1974

Discussed in https://github.com/scikit-bio/scikit-bio/discussions/1973