Repeatedly choosing back-to-back single base indels when a match is expected

rsharris commented 3 years ago

This is going to be a long post. I have included code details to replicate the problem, at the end.

I'm using sg_trace_striped_16 on two nearly identical strings of length ≈51Kbp. When I follow the cigar string to convert the alignment to the kind of pair of strings you'd see in e.g. UCSC MAF format, I notice a strange thing at the right end of the alignment. It has broken what ought to be a long match run by repeatedly inserting and deleting a single base. Below I show what part of that looks like. The three dots to the left represent many thousand bases that seem to be correctly aligned, with a few mismatches or indels where they should be. The three dots to the right represent many thousand bases that exhibit a similar pattern of delete-one-insert-one-match-8-or-so (I also see it at the end of the alignment, but I'm not sure if it exists throughout in between). Note that each deleted/inserted base is a match, not a mismatch, so there seems to be no mathematical reason to delete/insert.

reference: ...CTCATTTACCTTTCA-GTAAAATTG-TTAAATAAG-CAAAATAA-TTTGAGTT-AAAATTAGAAT-AAAAATTGT-CTTTTATTT-GGA...
query:     ...CTCATTTACCTTTCAG-TAAAATTGT-TAAATAAGC-AAAATAATT-TGAGTTAAA-ATTAGAATA-AAAATTGTC-TTTTATTTG-GA...

I've confirmed that the problem is not in the conversion of cigar to text. If I examine the cigar for this alignment, I see

[(2890,'='),(1,'I'),
 (12918,'='),(1,'X'),
 (5026,'='),(1,'X'),
 (2537,'='),(1,'X'),
 (17,'='),(1,'X'),
 (9395,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(8,'='),(1,'D'),(1,'I'),(7,'=') ...]

Moreover, if I remove that first 2890 bp match and single indel from my sequences and run again (i.e. remove 2891 from the reference and 2890 from the query), I get a cigar that agrees with the rest of that original cigar for a while, then diverges. And it, too, appears to eventually devolves into the delete-one-insert-one-match-8-or-so pattern. Mathematically, I would expect this alignment to be identical to the previous, except for the removal of those first two cigar ops. [(12918, '='), (1, 'X'), (5026, '='), (1, 'X'), (2537, '='), (1, 'X'), (17, '='), (1, 'X'), (11628, '='), (2, 'D'), (656, '='), (1, 'D'), (2, '='), (1, 'I'), (8, '='), (1, 'D'), (1, 'I'), (8, '='), (1, 'D'), (1, 'I') ...]

I'm using (or at least I think I am using) a simple scoring model rewarding 1 for a match, penalizing 3 for a mismatch, 4 for a gap open and 1 for a gap extend. Aligning these two sequences with the same (I hope) scoring model in lastz yields an alignment covering the entirety of the strings, with 5 mismatches and 5 gapped bases.

The two input sequences are available at https://psu.box.com/s/lhwm1zaag8qt0l0ypu1ksfh4ueus1z7x

Details:

import parasail
subsMatrix = parasail.matrix_create("ACGT",1,-3)
refDB = parasail.sequences_from_file("example.chm13.fa")
qryDB = parasail.sequences_from_file("example.hg38.fa")
# extract the nt sequence from ref and qry here
ref = refDB[0]
qry = qryDB[0]
a = parasail.sg_trace_striped_16(ref.seq,qry.seq,4,1,subsMatrix)
aCigar = list(map(lambda v:(a.cigar.decode_len(v),a.cigar.decode_op(v).decode("utf-8")),a.cigar.seq))
len(aCigar) # 6669
aCigar[:20]  # this is the first cigar list I show above

a2 = parasail.sg_trace_striped_16(refNucs[2891:], qryNucs[2890:],4,1,subsMatrix)
s2Cigar = list(map(lambda v:(a2.cigar.decode_len(v),a2.cigar.decode_op(v).decode("utf-8")),a2.cigar.seq))
len(a2Cigar) # 5628 I'd really expect it to just be two shorter than 6669
a2Cigar[:20]  # this is the second cigar list I show above

I also tried parasail.sg_trace_striped_32, thinking maybe some internal scoring was overflowing the 16-bit ints. But I got the same strange result. (I don't know if your implementation is using an overflow-protected recurrence like that in the Suzuki 2017 paper).

Not sure whether I am doing something wrong or not. Perhaps there's a limit to the size of strings allowed? That the pattern seems to repeat at 8 makes me wonder if it has to do with striping.

jeffdaily commented 3 years ago

I edited your issue comment, updating your test script to use the sequences_from_file. Yet another lack of proper documentation. Sorry. I'll make a note of it.

Not sure I mentioned this before, but my background is not bio, it's computer science. We might be facing the limit of my understanding, and the dangers of knowing just enough about something to be dangerous.

That said, the way that you decode the cigar in your sample code is new to me. I implemented a decode method in the C library that you can call from the python interface, but it produces a C string, not a list of tuples. Which form is better?

If you want to compare against the C library implementation of decoding the cigar, use result.cigar.decode. Again, poorly documented. It's a property of the Cigar class, though in hindsight it would make more sense had I treated it as a function since it is transforming the C cigar and returning a python string.

Besides being able to reproduce your observations from above, I'm not closer yet to understanding what's going on. Will keep you posted.

jeffdaily commented 3 years ago

I didn't pay close enough attention when I was editing your sample script. Did I swap your sequence order in the alignment? The first sequence should be the query, the second should be the target/reference. Not sure if this makes any impact for our problem. Still looking into it.

rsharris commented 3 years ago

Not sure I mentioned this before, but my background is not bio, it's computer science. We might be facing the limit of my understanding, and the dangers of knowing just enough about something to be dangerous.

My background is CS as well, though much of what I have done for the past 20 years is applying CS to bio problems. I originally was using sequence alignment in the early 1990s for pattern recognition, and didn't become involved with bio until about 2004.

I should note that my current interest is to try to use parasail as a component in a tool I'm working on. I have a higher-level process that identifies sequence pairs that are likely to be homologous. parasail looked like a good way to "render" those pairs to actual alignments. The python interface makes it easy to experiment with that. If the process turns out to be useful then I would probably rewrite in C and use the C interface directly.

I implemented a decode method in the C library that you can call from the python interface, but it produces a C string, not a list of tuples. Which form is better?

It depends on what you want to do with it. I had noticed result.cigar.decode in the readme, and tried it out in a python shell. That C string is what would typically be passed along to a downstream tool, because it's a compact representation yet (somewhat) human-readable (for example, this is what is in SAM/BAM files). To reconstruct a letter-by-letter alignment from the string one would have to parse the cigar string. Not difficult, but since what I needed were the length,operator pairs and I noticed that cigar.seq mentioned in the readme was a list of those encoded somehow, and because I discovered cigar.decode_len and cigar.decode_op by looking at what attribute names were supported for (some object that escapes me), I combined it all to get the form shown above.

I didn't pay close enough attention when I was editing your sample script. Did I swap your sequence order in the alignment? The first sequence should be the query, the second should be the target/reference. Not sure if this makes any impact for our problem. Still looking into it.

I don't think you swapped them — that's the order I was using. Which is which is just a naming convention, because the substitution matrix is symmetric. Where it matters in this example is the interpretation of I and D in the cigar string. I determined, experimentally, that I is extra letters in the first sequence, D is extras in the second.

In case anyone were to try using an asymmetric scoring matrix, it would be good to document which index in the matrix corresponds to which sequence.

As I look at the readme, the intended order is only apparent from the description of the function names, where it uses the terms "s1/query" and "s2/database".

To avoid confusion down the road, I will change the order in my stuff.

I edited your issue comment, updating your test script to use the sequences_from_file.

Thanks. I can see how that works, now. I fiddled with the object from sequences_from_file for maybe a half hour before I gave up. I never hit on the right combination. I gained a new appreciation for how difficult it is to figure out an object's interface just by snooping in a shell. Trying it now, I think I my mind conflated the parasail.bindings_v2.Sequences object with the parasail.bindings_v2.Sequence object.

jeffdaily commented 3 years ago

I have confirmed, again, the wrong behavior using the C library and a small test program that prints the traceback rather than the cigar string. The C code for creating the cigar and the traceback, though related, is different -- and the unusual alignment results are appearing in both cases.

It looks like things are going well, until they aren't. Here's a snippet:

ref        32561 ACTTTGTATAGCCCCACCTTAATTTTTGGTGTTGTTTTAAAATTTCATTTTAATAACATAATATTATAAGATAAGGTAAC   32640
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
query      32560 ACTTTGTATAGCCCCACCTTAATTTTTGGTGTTGTTTTAAAATTTCATTTTAATAACATAATATTATAAGATAAGGTAAC   32639

ref        32641 TTGGTACTAATTTCTGTTGTATGATCCATCTTAAGTTGCAGCGCTGGTTACTTTTTTGACTTTCGATGACGAACAGCTAT   32720
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
query      32640 TTGGTACTAATTTCTGTTGTATGATCCATCTTAAGTTGCAGCGCTGGTTACTTTTTTGACTTTCGATGACGAACAGCTAT   32719

ref        32721 TTGTACATAAGTTACCATAGCAATGTTAGGTAATTATAATCTGTCCTATTTATCTCATTTACCTTTCAG-TAAAATTGT-   32798
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  ||||||||
query      32720 TTGTACATAAGTTACCATAGCAATGTTAGGTAATTATAATCTGTCCTATTTATCTCATTTACCTTTCA-GTAAAATTG-T   32797

ref        32799 TAAATAAGC-AAAATAATT-TGAGTTAAA-ATTAGAATA-AAAATTGTC-TTTTATTTG-GATTACATG-AATAATCTA-   32870
                 ||||||||  ||||||| | |||||| || ||||||||  ||||||||  ||||||||  ||||||||  ||||||||
query      32798 TAAATAAG-CAAAATAA-TTTGAGTT-AAAATTAGAAT-AAAAATTGT-CTTTTATTT-GGATTACAT-GAATAATCT-A   32869

I'm still digging.

jeffdaily commented 3 years ago

I've got a C test that seems to reproduce what's going on. It might actually be saturating during the computing of the dynamic programming table, but failing to report it. I tried both the prefix "scan" and "striped" algorithms, using both 16- and 32-bit intermediate ints. The 16-bit scan alg reports that the result saturated, but the striped alg does not. When I get what seems to be the correct result with the 32-bit algorithms, the final score is beyond the 16-int boundaries. The implementation uses signed integers, so the max score is 32767, but the reported score was 50985, beyond the signed 16-bit capacity.

It's a bug that the striped_16 algorithm did not report saturation. But I'm also guessing, based on the common refrain of my bad documentation, that you didn't know you should check the result for saturation.

The saturation check on the result in C is parasail_result_is_saturated(result). In Python it should be just the property Result.saturated. But again, that will only help if the algorithm is correctly capturing saturation. I've noted the bug and am looking into it.

rsharris commented 3 years ago

Cool. Thanks for digging into that.

It's a bug that the striped_16 algorithm did not report saturation.

I considered that possibility (at the bottom of my original post). I think I had the same problem with striped_32. But it was a long day — I know I ran striped_32 but I may have confused myself and looked again at the output from striped_16.

When I convinced myself that striped_32 had the same problem, I surmised that you might be using an overflow-safe version of the recurrence, as described in Suzuki, Hajime, and Masahiro Kasahara. "Introducing difference recurrence relations for faster semi-global alignment of long sequences." https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2014-8

But I'm also guessing, based on the common refrain of my bad documentation, that you didn't know you should check the result for saturation.

Right. Nor how to check (which you've shown above). Saturation isn't mentioned in the readme, and overflow is only mentioned in passing, describing a function variant that retries with a wider int if overflow is detected.

jeffdaily commented 3 years ago

OMG. sg striped doesn't even check for saturation. Huge mistake. I definitely need a hotfix for this.

As for the Python interface, do you think I should proactively check for saturation and raise a RuntimeError if the underlying C function saturates? Currently, both C and Python will return a non-NULL/None result and it's left to the caller to check. At least for Python, I could check on behalf of the caller and raise the exception. What do you think?

rsharris commented 3 years ago

If you raise an exception, OverflowError is more appropriate (I found a list at https://docs.python.org/3/library/exceptions.html).

Returning None might also be reasonable (instead of the exception), though the calling function might kinda wonder whether there were conditions other than overflow that might cause it. So I guess an exception is a better solution.

jeffdaily / parasail-python

Repeatedly choosing back-to-back single base indels when a match is expected #59