vgteam / gssw

efficient alignment of strings to partially ordered string graphs
33 stars 17 forks source link

Incorrect values in E matrix #8

Closed jeizenga closed 7 years ago

jeizenga commented 8 years ago

Under some circumstances, one of the matrices in the dynamic programming seems to have incorrect values. It can be reproduced with this invocation:

gssw_example CCCCCCCCCTCCCCCCCCCCTCCCCCCCCCCGACCCCCCCCCCC CCCCCCCCCCACCCCCCCCCCACCCCCCCCCCTCCCA CCCCCACCCCCCCCGTCCCCCCCCCCCA CCCCCCCCCCCCGCCCCCCCCCCGCCCCCCCCC CCCCCCCTCCCCCCCCCCTCCCCCCCCCCGACCCCCCCCCCCCCCCCCCCCCACCCCCCCCCCACCCCCCCCCCTCCCACCCCCCCCCCCCGCCCCCCCCCCGCCCCCCCCC

Look at row 91, row 102, and column 13 in the matrix for node 4 in the output. Row 91 is also largely incorrect in the matrices for nodes 2 and 3.

adamnovak commented 7 years ago

I'm going to attack this by implementing a non-SSW, non-swizzled matrix filler that should always get the right answer, and then comparing its answers to those we get from the more optimized code.

Ideally this will turn into a real unit test.

Then I'm going to bang on the optimized version until it can be induced to match the simple version's results.

adamnovak commented 7 years ago

OK, I've implemented a de-optimized simple software matrix filler, and (unfortunately), within the main H matrix, I don't see any difference between it and and the optimized SSE2 filler.

There are differences in the gap matrices (E and F), but I think those may be caused by some clever optimizations (i.e. the lazy F loop), because I haven't seen anything make it through into the main H matrix that describes the best overall alignment.

Since we're now looking for second-best alignments, maybe we can no longer make those same optimizations, because we need more of the E and F matrices to be filled in? In that case we'll probably have to rip out the lazy F loop thing altogether.

It's also possible that the bug in question is happening at the node boundaries, in which case I will have to implement a de-optimized version of or otherwise check that code as well.

It would help if you could elaborate on what exactly is incorrect about the specified matrix entries in this example. Is it just the E and/or F matrices that have wrong values? Or is there supposed to be an error in the H matrix?

jeizenga commented 7 years ago

Sure. Some of the E values are too low, but too low only by a few points. The incorrect values concentrate in specific rows/columns in the DP matrix (i.e. a long stretch of one row will all be incorrect). In every case, the correct DP from those cells is to open a gap from the H matrix. The fact that the scores are similar to the correct DP suggests to me that they probably did open a gap from the H matrix, but that the H value hadn't yet reached it's final value in the lazy F loop.

I will admit that my understanding of the lazy F loop is a little hazy, but I don't think it can be true that having incorrect E values does not threaten correctness of the algorithm for the Viterbi alignment. It completely invalidates the correctness proof if "some" of the DP values are incorrect. Give me some time with it and I'm pretty sure I could come up with a case where the primary alignment also fails the traceback.

That said, I suspect it will be possible to retain the lazy F loop. What the bug looks like to me is that there's a loop somewhere in the lazy F loop that (correctly) updates H from an F gap, but doesn't update the subsequent E values with the new H value. If I'm right, that's just a coding bug, not a fundamental problem with the algorithm.

adamnovak commented 7 years ago

I don't think that what you're saying can quite happen. The computation always happens in columns; the E matrix depends on the previous column, the F matrix depends on prior rows in the current column, and the H matrix depends on both. The lazy F loop fixes the intra-column dependencies.

If you're opening a gap in the read in the E matrix, you're coming from the previous column's H matrix, which really shouldn't be able to change after you read it. That column's lazy F loop is already finished before this column starts getting computed.

I'll see if I can enforce that the E matrix is always fully computed and matches between the two methods. I'll also go back and read the paper again and see if the lazy F loop is supposed to be leaving any F values uncomputed.