samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

Closed jts closed 1 year ago

jts commented 1 year ago

When the bandwidth parameter provided in probaln_par_t to probaln_glocal is the length of the input seq or greater a suboptimal alignment is returned.

Here's a minimal test case (replacing main in probaln.c):

#include <unistd.h>
int main(int argc, char *argv[])
{

    uint8_t seq[] = { 0, 3, 2, 0, 3, 2, 1, 1, 3, 2, 3, 2, 2 };
    int l_seq = sizeof(seq);
    int* state = malloc(l_seq * sizeof(int));
    uint8_t* q = malloc(l_seq);
    probaln_par_t par = { 0.01, 0.1, l_seq /* - 1*/};
    int P = probaln_glocal(seq, l_seq, seq, l_seq, NULL, &par, state, q);
    printf("bw: %d P: %d\nstate: ", par.bw, P);
    for(int i = 0; i < l_seq; ++i) {
        printf("%02d ", state[i] >> 2);
    }
    printf("\ninsrt: ");
    for(int i = 0; i < l_seq; ++i) {
        printf("%02d ", state[i] & 3);
    }
    printf("\n");
    free(state);
    free(q);

    return 0;
}

Here I have passed the same 13bp sequence as both query and reference, so would expect a trivial alignment to be found. Instead, the alignment has the last base inserted rather than matched and the returned likelihood is quite low:

bw: 13 P: 29
state: 00 01 02 03 04 05 06 07 08 09 10 11 11 
insrt: 00 00 00 00 00 00 00 00 00 00 00 00 01 

If I change the bandwidth to be l_seq - 1 the expected alignment is found:

bw: 12 P: 6
state: 00 01 02 03 04 05 06 07 08 09 10 11 12 
insrt: 00 00 00 00 00 00 00 00 00 00 00 00 00 
jkbonfield commented 1 year ago

I've started looking at this, so it's on our radar.

One problem is this code (and the forward direction code similarly above it)

https://github.com/samtools/htslib/blob/develop/probaln.c#L271-L277

    for (k = 1; k <= l_ref; ++k) {
        int u;
        double *bi = &b[l_query*i_dim];
        set_u(u, bw, l_query, k);
        if (u < 3 || u >= i_dim - 3) continue;
        bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
    }

As we're going from k up to and including l_ref, our dimension for the array needs to be l_ref+1. We've allocated at most to l_ref (or l_query, but same thing here). Hence the protection against overstepping the memory. Unfortunately that means we don't fill out those bi cases either.

The code is clear as mud though, so it's not clear if this is meant to be protected by starting at index 1 instead of 0, or if we're just limiting idim to be 1 too few. Changing idim does fix it, but I don't know if it's the correct solution yet. Continuing my descent into madness. :-)

jkbonfield commented 1 year ago

Futher investigation shows there are three such if (u < 3 || u >= i_dim - 3) continue statements.

Checking the allocations of f and b arrays I see it uses f = calloc((l_query+1)*i_dim, sizeof(double)), so it's already allocated +1 on the query length, meaning k <= l_ref isn't an issue... I think. I'm unclear of ref vs query length here still, but it looks like this is compensated for by if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);.

So given we've allocated +1, the u >= i_dim - 3 is clearly not needed for memory protection reasons. Equally so u < 3 isn't too. I don't understand what the purpose of these instructions are.

What I have discovered by trial and error is u < 0 does change the BAQ output in about 1 in 3700 cases, but u >= i_dim doesn't for the tests I've run so far (10 million reads). Using >= i_dim instead of >= i_dim-3 does cure this issue too, so it's what I'm going to go with.

What I'm currently unsure about though is the u < 3 vs u < 0 differences. It's different for sure, but is it worse or better? That's challenging! I'll probably leave it as-is for sake of reproducability, but it'd be interesting to figure out why it exists at all given we only ever access u, u+1 and u+2. I wonder if it's just an accident brought about by thinking of things like set_u(v11, bw, i-1, k-1), but if so it's wrong as those negatives get applied before computing u and not after.