noporpoise / seq-align

Fast, portable C implementations of Needleman-Wunsch and Smith-Waterman sequence alignment
92 stars 40 forks source link

Incorrect results when reading sequences from files #9

Open threadless-screw opened 5 years ago

threadless-screw commented 5 years ago

I ran into some issues with the pairwise alignment of sequences read from different files. Although the exact issue varies from case to case, it seems that the first alignment produces the correct result, while further alignments generally do not. Here's an example:

If a first file "f1" contains the following (identical) sequences:

NEEDLE
NEEDLE
NEEDLE

and a second file "f2" contains the following (identical) sequences:

HAYSTACK NEEDLE HAYSTACK
HAYSTACK NEEDLE HAYSTACK
HAYSTACK NEEDLE HAYSTACK

then running the command "smith_waterman --files f1 f2" returns the output:

== Alignment 0 lengths (6, 24):

hit 0.0 score: 12
  NEEDLE  [pos: 0; len: 6]
  NEEDLE  [pos: 9; len: 6]

==
== Alignment 1 lengths (6, 24):

==
== Alignment 2 lengths (6, 24):

==

The results for Alignment 0 are correct, but the fact that results are absent for Alignments 1 and 2 seems off. Interestingly, the dynamic programming matrices (rendered as output with the --printmatrices option) for the three alignments are the same, as they should be. The problem is therefore likely to be found at a superficial level.

threadless-screw commented 5 years ago

I have unfortunately not gotten to the bottom of the issue. As far as I can tell, the re-use of the sw and result variables [see sw_cmdline.c] when iterating over the multiple sequences in the files is at the heart of it; it seems a previous iteration leaves its traces into the next. Practically, the problem can be overcome by wrapping each align() function call in its own initialization/destruction routines, like so:

    sw = smith_waterman_new();
    result = alignment_create(256);

    align( ....);

    smith_waterman_free(sw);
    alignment_free(result);

Since this somewhat wasteful solution is no more than a patch that leaves the underlying issue unresolved, I have not made a pull request.