jeffdaily / parasail-python

Python bindings for the parasail C library.
Other
87 stars 17 forks source link

Incorrect nw_trace results with Python 3.6 #16

Closed ihaque-freenome closed 6 years ago

ihaque-freenome commented 6 years ago

Using:

seq =     b'TGAGCCACTGTNCNCAACTGCTTCCCATTCTANANAGTGTCTNAGNCAGTTATAGAGGTTATATANCAAGGTATAAAATTTCTGAGTNAAAGNAAANTAAGGAGNNAGAAATA'
ref_seq = b'TGAGCCACTGTGCTCAACTGCTTCCCATTCTAAAAAGTGTCTAAGGCAGTTATAGAGGTTATATAACAAGGTATAAAATTTCTGAGTGAAAGGAAAATAAGGAGAGAGAAATA'

parasail.nw_trace(seq, ref_seq, gap_open=6, gap_extend=1, matrix)

with the following custom matrix:

   A    C    G    T    N
A  1   -4   -4   -4    0
C  -4   1   -4   -4    0
G  -4  -4    1   -4    0
T  -4  -4   -4    1    0
N   0    0   0    0    0

I get the following results under Python 2.7 with parasail 1.1.7:

cigar: 4=1X4=2X2=1X1=2X2=1X3=1X1=2X1=1X3=1X1=1X11=1X1=1X18=1X1=1X7=1X2=1X19=1X21=1X4=1X3=1X8=1X7=
score: 124

Under Python 3.6, using current HEAD (953abfe4ae90bb9c61fed790ad1ac30baec6bb20):

cigar: 11I1X1=11D1X1=2X2=1X3=1X1=2X1=1X3=1X1=1X11=1X1=1X18=1X1=1X7=1X2=1X19=1X21=1X4=1X3=1X8=1X7=
score: 1684037810

The Py2.7 CIGAR string is as expected - there are a number of bases N'd out in the query that are present in the reference, so we see a number of mismatches. The Py3.6 CIGAR string is crazy, with tons of indels and an insane aligner score. Possibly some ctypes pointer errors?

jeffdaily commented 6 years ago

I realize I closed this but perhaps did not address the root issue. I did discover some bugs in the Python3 interface and fixed them in the latest master. I was storing a reference to the input sequences in the Result object to help create the Cigar object, but I wasn't converting the strings to bytes before calling the C function for cigar creation. Please try again and reopen this issue if it still isn't working.

jeffdaily commented 6 years ago

Yeah I closed this one too soon. I'm getting errors with Python3 and custom matrices. Are you creating the matrix from a text file or are you using the parasail.matrix_create() function?

ihaque-freenome commented 6 years ago

It didn't seem possible to create the matrix specified in the original comment (with zero cost for matches/mismatches to N, but constant match/mismatch penalty for the other bases) using matrix_create, so I've been doing it from a file like so:

MATRIX_TEMPLATE = \
"""# Auto-generated substitution matrix
       A        C        G        T    N
A  %(m)d   %(mm)d   %(mm)d   %(mm)d    0
C  %(mm)d  %(m)d   %(mm)d   %(mm)d    0
G  %(mm)d  %(mm)d   %(m)d   %(mm)d    0
T  %(mm)d  %(mm)d   %(mm)d   %(m)d    0
N       0       0        0       0    0
"""

def make_matrix(match, mismatch):
    with tempfile.NamedTemporaryFile() as tf:
        tf.write((MATRIX_TEMPLATE % {'mm': mismatch, 'm': match}).encode('ascii'))
        tf.flush()
        return parasail.Matrix(tf.name)

matrix = make_matrix(1, -4)
jeffdaily commented 6 years ago

Are you able to test this again using the latest parasail-python install coupled with the latest parasail C library? I was able to see garbage using python3 earlier today but now I can't seem reproduce.

ihaque-freenome commented 6 years ago

Yes, I was able to reproduce just now by cloning this repo and building a wheel using python setup.py bdist_wheel (which appears to have downloaded/built the latest parasail C library), and testing my code with that installed binary.

Here's a minimal test program. It's interesting because it shows that the issues with the CIGAR aren't just random insertions/deletions.

import parasail
import tempfile

MATRIX_TEMPLATE = \
"""# Auto-generated substitution matrix
       A        C        G        T    N
A  %(m)d   %(mm)d   %(mm)d   %(mm)d    0
C  %(mm)d  %(m)d   %(mm)d   %(mm)d    0
G  %(mm)d  %(mm)d   %(m)d   %(mm)d    0
T  %(mm)d  %(mm)d   %(mm)d   %(m)d    0
N       0       0        0       0    0
"""

def make_matrix(match, mismatch):
    with tempfile.NamedTemporaryFile() as tf:
        tf.write((MATRIX_TEMPLATE % {'mm': mismatch, 'm': match}).encode('ascii'))
        tf.flush()
        return parasail.Matrix(tf.name)

matrix = make_matrix(1, -4)
seq = b'N'
ref_seq = b'A'

result = parasail.nw_trace(seq, ref_seq, 6, 1, matrix)
assert result.cigar.decode == b'1X', result.cigar.decode
assert result.score == 0, result.score

Under Python 3.6 I see:

(venv) $ python minimal_test.py 
b'1='
1311600239

and under Python 2.7:

(venv27) $ python minimal_test.py 
1X
0
jeffdaily commented 6 years ago

I found the bug and it's in the python3 interface. I assumed that all sequences passed to alignment functions would be of type str. So I was blindly using codecs.latin_1_encode for all sequences. I checked the inputs to the C nw_trace function and saw the inputs were b'N' and b'A' instead of just N and A. I will add code to check the sequence type first and only encode as needed.

The question I have for you is what should be the return type of the parasail-python Cigar.decode function? Your code asserted that it was bytes. Should it be str instead? I could codecs.latin_1_decode the C result from Cigar.decode back to unicode. Your thoughts?

ihaque-freenome commented 6 years ago

Ah, that makes total sense - I can't remember why I changed those to byte strings; I think I might have had an issue elsewhere in the code and globally changed them (this was a port from 2.7 originally). When I run using normal strings now it works fine.

If the function expects str in then it likely makes more sense for it to return str as well. Seems like codecs.ascii would work fine for the alphabet in CIGAR strings, but I don't know that it matters much.