nsbuitrago / parasail-rs

SIMD pairwise sequence alignment
https://crates.io/crates/parasail-rs
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Badly-aligning sequences find zero alignment, with high score #8

Closed sclamons closed 2 months ago

sclamons commented 2 months ago

Not sure if this is a bug or just me misunderstanding the parasail interface. I'm trying to find alignments between very short sequences (20bp and 32 bp), many of which have no real overlap at all. I'd like to disallow gaps, or at least make them very expensive. Ideally, when I try to align two unrelated sequences, I would like parasail to make a best-guess alignment with tons of mismatches and report the number of matches (probably low, around 8). The number I want to get from this, ultimately, is how many bases of match there are in an optimal, (near-)gap-free alignment.

I've done this operation using the python bindings for parasail. It works as I expected using roughly this code:

profile = parasail.profile_create_16(query, PS_SCORE_MATRIX)
for target in [collection]:
    match = parasail.sw_striped_profile_16(
            profile,
            target,
            open=255,
            extend=255
        )
    do_something_with(match.score)

This produces a poor alignment with a score of 6

    GGACTCCTATCTCGTATGCCGTCTTCTGCTTG
      ||| |         ||
     CACTTCACAGCAAAATAAAC

When I try (what I think is) the same thing with parasail_rs, I get some odd results. Let's take an example with

let target = "CACTTCACAGCAAAATAAAC";
let query = "GGACTCCTATCTCGTATGCCGTCTTCTGCTTG";

First I build an aligner with

let matrix = &Matrix::create(b"ACTG", 1, 0).unwrap();
let profile = Profile::new(query.as_bytes(), false, matrix).unwrap();
Aligner::new()
        .profile(profile)
        .striped()
        .local()
        .gap_open(255)
        .gap_extend(255)
        .build()

Then I align with let result = aligner.align(None, target.as_bytes()).ok();. This produces a result with score 29(!!!), end_query = 26, end_ref = 19, which I think is this alignment:

GGACTCCTATCTCGTATGCCGTCTTCTGCTTG
      |  |           |   |                   
      CACTTCACAGCAAAATAAAC

I say "I think" because I'm not entirely sure I understand the indexing of end_query and end_ref, and because turning on tracebacks (by adding .use_trace() to AlignerBuilder construction) completely changes the alignment to

Target:       1 CACTTCACAGCAAAATAAAC-------------------------------      20

Query:        2 --------------------GACTCCTATCTCGTATGCCGTCTTCTGCTTG      32

Length: 51
Identity:         0/51 ( 0.0%)
Similarity:      51/51 (100.0%)
Gaps:             0/51 ( 0.0%)
Score: 51

as reported by r.print_traceback(query.as_bytes(), target.as_bytes());, which looks like it's trying to do a global alignment, and somehow scoring indels as positive score? This really looks like I'm not building the aligner I think I am, and I don't understand why adding tracebacks would change the alignment, and I really don't understand why these alignments are all coming back with such high scores.

Any insight would be appreciated, whether that means finding bugs or correcting your user!

nsbuitrago commented 2 months ago

I think I know what part of the issue is. The main difference is the python implementation is doing 16 bit operation while the default behavior in parasail-rs is to try 8-bit and if overflow is detected, use the 16-bit solution. I can replicate the python behavior with the libparasail-sys crate using the 16-bit mode. The problem you are experiencing with parasail-rs is replicated when I test the same alignment using sat mode (trying 8bit first then 16bit if needed).

fn alignment_16_bit() {
    let target = "CACTTCACAGCAAAATAAAC";
    let query = "GGACTCCTATCTCGTATGCCGTCTTCTGCTTG";

    unsafe {
        let matrix = parasail_matrix_create(b"ACGT".as_ptr() as *const i8, 1, 0);
        let profile = parasail_profile_create_16(
            query.as_bytes().as_ptr() as *const i8,
            query.as_bytes().len() as i32,
            matrix,
        );
        let result = parasail_sw_striped_profile_16(
            profile,
            target.as_bytes().as_ptr() as *const i8,
            target.as_bytes().len() as i32,
            255,
            255,
        );

        parasail_matrix_free(matrix);
        parasail_profile_free(profile);

        println!("{}", (*result).score)
    }
}

fn alignment_sat() {
    let target = "CACTTCACAGCAAAATAAAC";
    let query = "GGACTCCTATCTCGTATGCCGTCTTCTGCTTG";

    unsafe {
        let matrix = parasail_matrix_create(b"ACGT".as_ptr() as *const i8, 1, 0);
        let profile = parasail_profile_create_sat(
            query.as_bytes().as_ptr() as *const i8,
            query.as_bytes().len() as i32,
            matrix,
        );
        let result = parasail_sw_striped_profile_sat(
            profile,
            target.as_bytes().as_ptr() as *const i8,
            target.as_bytes().len() as i32,
            255,
            255,
        );

        parasail_matrix_free(matrix);
        parasail_profile_free(profile);

        println!("{}", (*result).score)
    }
}

Results in very different scores: alignment_16_bit score: 6 alignment_sat score: 32

I think for now the solution would be to add in the solution_width method on AlignerBuilder to adjust the default behavior. For example,

let matrix = &Matrix::create(b"ACTG", 1, 0).unwrap();
let profile = Profile::new(query.as_bytes(), false, matrix).unwrap();
let aligner = Aligner::new()
        .profile(profile)
        .striped()
        .local()
        .gap_open(255)
        .gap_extend(255)
        .solution_width(16)
        .build();

If possible, could you confirm expected behavior after setting the solution width to 16 bit as shown in the example above and updating parasail-rs to the dev branch? To use the dev branch just update Cargo.toml:

[dependencies]
parasail-rs = { git = "https://github.com/nsbuitrago/parasail-rs.git", branch = "dev" }

Now, I have no idea why use_trace() results in significantly different alignments. I will have to look into this more.

sclamons commented 2 months ago

That was fast! Thanks for getting to this so promptly. I'll try forcing 16 bits and see how it does.

sclamons commented 2 months ago

Looks like I'm getting the expected alignment scores now. Thank you so much!

nsbuitrago commented 2 months ago

Great! You should be able to use v0.7.4 now in your Cargo.toml