jeffdaily / parasail-python

Python bindings for the parasail C library.
Other
90 stars 17 forks source link

Allowing negative score in semi-global alignment #29

Closed mdcao closed 5 years ago

mdcao commented 5 years ago

Hi there Thanks you for making a great alignment tool. I have been using it extensively, mostly for semi-global alignment.

I recently found a few very rare cases that semi-global alignment returned incorrect alignment, which I think because it did not allow negative alignment score like global alignment. Let me illustrate with an example. Consider two sequences

s1=CAGAACAGGACCACGACACTC
s2=ACACAGAAAAGGAACCCGCACTCCATC

where their alignment is

---CAGAACAGG-ACCACGACACTC---
   ||||| ||| ||| || |||||
ACACAGAAAAGGAACC-CG-CACTCCATC

If we run the semi-global alignment with large penalty, such as match=1, mismatch=gap_opening=gap_extension=-5, the semi_global alignment

match=1
mismatch=-5
gap_opening=-5
gap_extension=-5
user_matrix = parasail.matrix_create('ACGT', match, mismatch)
sg_aln = parasail.sg_trace_scan_sat(s1, s2, -gap_opening, -gap_extension, user_matrix)
print(sg_aln.cigar.decode, sg_aln.score)

# ('26D1=20I', 1)
# query   --------------------------CAGAACAGGACCACGACACTC',
#                                   |
# subject ACACAGAAAAGGAACCCGCACTCCATC--------------------',
#

The global alignment returns more sensible alignment, albeit incorrect towards the ends because it does not trim the two ends of the subject

match=1
mismatch=-5
gap_opening=-5
gap_extension=-5
user_matrix = parasail.matrix_create('ACGT', match, mismatch)
sg_aln = parasail.nw_trace_scan_sat(s1, s2, -gap_opening, -gap_extension, user_matrix)
print(sg_aln.cigar.decode, sg_aln.score)
# ('3D5=1X3=1D3=1I2=1D2=1X1=2D2=', -32)
# query    ---CAGAACAGG-ACCACG-ACAC--TC
#             ||||| ||| ||| || || |  ||  
# subject  ACACAGAAAAGGAACC-CGCACTCCATC

IMHO, it makes sense for the local alignment to not allow negative alignment score as we want only the significant (local) alignment from trimming the ends of both query and subject. However, for semi-global alignment, just like global alignment, we want the best alignment of the whole query and hence we should allow negative alignment score. Of course, if we have sensible parameters, that issue should not be there, but it is not always possible to have the best parameters in hand.

I am wondering if you consider implement this scoreing to make semi-global consistent with global aligmnent?

Thanks Minh

jeffdaily commented 5 years ago

I will need to try reproducing your inputs to verify what’s going on. Global should be consistent with semi-global since they both allow the score to become negative.

I wonder if the error is caused by the integer size being used not being wide enough to hold intermediate solutions. In any case, I’ll try to reproduce when I find time. Thanks for the report.

On Dec 2, 2018, at 9:27 AM, Minh Duc Cao notifications@github.com<mailto:notifications@github.com> wrote:

Hi there Thanks you for making a great alignment tool. I have been using it extensively, mostly for semi-global alignment.

I recently found a few very rare cases that semi-global alignment returned incorrect alignment, which I think because it did not allow negative alignment score like global alignment. Let me illustrate with an example. Consider two sequences

s1=CAGAACAGGACCACGACACTC s2=ACACAGAAAAGGAACCCGCACTCCATC

where their alignment is

---CAGAACAGG-ACCACGACACTC--- ||||| ||| ||| || ||||| ACACAGAAAAGGAACC-CG-CACTCCATC

If we run the semi-global alignment with large penalty, such as match=1, mismatch=gap_opening=gap_extension=-5, the semi_global alignment

match=1 mismatch=-5 gap_opening=-5 gap_extension=-5 user_matrix = parasail.matrix_create('ACGT', match, mismatch) sg_aln = parasail.sg_trace_scan_sat(s1, s2, -gap_opening, -gap_extension, user_matrix) print(sg_aln.cigar.decode, sg_aln.score)

('26D1=20I', 1)

query --------------------------CAGAACAGGACCACGACACTC',

|

subject ACACAGAAAAGGAACCCGCACTCCATC--------------------',

#

The global alignment returns more sensible alignment, albeit incorrect towards the ends because it does not trim the two ends of the subject

match=1 mismatch=-5 gap_opening=-5 gap_extension=-5 user_matrix = parasail.matrix_create('ACGT', match, mismatch) sg_aln = parasail.nw_trace_scan_sat(s1, s2, -gap_opening, -gap_extension, user_matrix) print(sg_aln.cigar.decode, sg_aln.score)

('3D5=1X3=1D3=1I2=1D2=1X1=2D2=', -32)

query CAGAACAGG-ACCACG-ACAC--TC

||||| ||| ||| || || | ||

subject CAGAAAAGGAACC-CGCACTCCATC

IMHO, it makes sense for the local alignment to not allow negative alignment score as we want only the significant (local) alignment from trimming the ends of both query and subject. However, for semi-global alignment, just like global alignment, we want the best alignment of the whole query and hence we should allow negative alignment score. Of course, if we have sensible parameters, that issue should not be an issue, but it is not always possible to have the best parameters in hand.

I am wondering if you consider implement this scoreing to make semi-global consistent with global aligmnent?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/jeffdaily/parasail-python/issues/29, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA3MOKpSLMC2i-ssgJ918aY4HI6KFZXHks5u1A2VgaJpZM4Y9gS8.

mdcao commented 5 years ago

Hi @jeffdaily , wonder if you had a chance to look at the issue?

jeffdaily commented 5 years ago

I was able to reproduce your results given your inputs. Now I need to understand if it is doing the right thing for global and semi-global.

Have you tried your inputs with a different alignment tool to verify whether parasail's results are consistent with other tools?

mdcao commented 5 years ago

Thanks. The closest I could find is seqanpy

import seqanpy
s1='CAGAACAGGACCACGACACTC'
s2='ACACAGAAAAGGAACCCGCACTCCATC'
aln=seqanpy.align_overlap(s2,s1,score_match=1, score_mismatch=-5, score_gapext=-5, score_gapopen=-5)
aln
#gives (-2, 'ACACAGAAAAGGAACC-CG-CACTCCATC', '---CAGAACAGG-ACCACGACACTC----')

In the case that the alignment value is positive, or for glocal alignment, both parasail and seqanpy give the same results

jeffdaily commented 5 years ago

One difference between semi-global and global is the boundary conditions. In semi-global, the left-most column and upper row (not shown) are always initialized to 0 so as not to penalize the beginning of sequences. In global, the initial conditions are -gap_open - gap_extend*i, where i is the index of the sequence.

The other difference is in global, the score is always the last value in the table. In semi-global, the score is the maximum of the last row and last column (in this case, it is the first row's last value "1").

Based on your last comment, the seqanpy might be computing the same semi-global score table but it is selecting a different final score (green) versus parasail (red).

image

I'm not sure what the criteria would be for seqanpy's "overlap" selecting not the max in both the last row and column. Perhaps it is selecting the max value for the shorter of the two sequences? Or perhaps it will always only take the max of the last row since it assumes the query (s1) will always be smaller than the target (s2)? Do you have any insight on this?

jeffdaily commented 5 years ago

What I'm after is knowing what the canonical version of semi-global should be. I thought it was what I had provided. I am aware of an established parasail user base that depends on the current parasail semi-global behavior. That said, I am not opposed to yet one more implementation so long as I don't duplicate too much code.

Parasail doesn't have any notion of options that can tweak algorithm behavior. There are three primary algorithms, parasail sw, sg, and nw. Since what you are requesting is a slight tweak of sg, I'm wondering how best to indicate this to the code. An explicit function argument? Something like

 extern parasail_result_t* parasail_nw_trace_scan_sat(
         const char * const restrict s1, const int s1Len,
         const char * const restrict s2, const int s2Len,
         const int open, const int gap,
         const parasail_matrix_t* matrix,
         const int flags); /* <-- the new parameter */

But that would be a backwards-incompatible change and I would rather avoid that.

How would you feel about a setting that could be turned on/off such as a new function parasail_set_flags(int flags)? It could be called more than once during program execution in case you wanted to switch back and forth with the behavior.

I'm struggling a bit whether to say semi-global is giving the wrong answer. If you changed your match and mismatch values to 5 and -4 as in the DNAfull matrix, you will get the alignment you expect.

mdcao commented 5 years ago

The option to turn on/off parasail_set_flags(int flags) sounds great to me. Really appreciate your time for this.

I do not find an official definition of semi-global, but this lecture note gives a comprehensive discussion about various modes of alignment: http://www.cs.cmu.edu/~durand/03-711/2015/Lectures/PW_sequence_alignment_2015.pdf

You will also notice from the lecture that the max function is only applied to local alignment to enforce the alignment score to not be negative, but global and semi-global should be treated equally --- just different penalty to gaps at the ends.

As mentioned originally, this happens very rarely, and in such case as of statistically skewed sequences. We generally do not know the right params until we do alignment and see the results. In the given example and the given params, the alignment

-2,
---CAGAACAGG-ACCACGACACTC----
ACACAGAAAAGGAACC-CG-CACTCCATC

looks better than

1, 
--------------------------CAGAACAGGACCACGACACTC
ACACAGAAAAGGAACCCGCACTCCATC--------------------

(IMHO) With thanks Minh

mdcao commented 5 years ago

Just saw your earlier comment. I think it is ideal to be able to specify which ends allowing gap penalties, that is to select the best score on the last column or the last row. The seqan (C++ version) gave that option (where one can specify how the penalty in the ends should be handled), the python wrapper in seqanpy only implements a few of them.

mdcao commented 5 years ago

And just to give you an idea of the practical problem I am trying to do with semi-global here. Suppose I have a sequence read and I want to know from which genome it comes from. So I want the alignment to have gap penalty-free at only the two ends of the read, not at the end of the reference.

Again, thank so much for the great tool. Minh

jeffdaily commented 5 years ago

Thank you for that PDF. It clearly outlines 8 cases for semi-global alignment. Indeed, this means there is a deficiency in the current parasail semi-global implementation. Currently, parasail semi-global does not penalize gaps in both sequences and at both ends, all at the same time. Though this seems to work okay for some cases, it does not guarantee to return the desired alignment. And the PDF you shared indicates this is not a correct implementation of semi-global alignment to allow gaps in both sequences at the same ends simultaneously.

I think the use case will be as I suggested earlier, a function for setting flags to control semi-global behavior. I don't see a need for flags to control global or local alignment behavior since global and local alignment seems to be very well defined. So I will create a function for tailoring semi-global behavior only. Ideally, this would be an extra argument to the semi-global functions. However, I have so much infrastructure in the code that assumes all alignment routines have the same function signature.

Would you agree that a user would set these semi-global flags once and only once for their application, or can you see a use case where you might want to allow gaps in the query during a portion of your application but then change it to disallow gaps later on?

mdcao commented 5 years ago

This is how seqan letting the user to choose the semi-global mode they want using AlignConfig class: https://seqan.readthedocs.io/en/master/Tutorial/Algorithms/Alignment/PairwiseSequenceAlignment.html#overlap-alignments

As for your question, I think an user may want to back and ford as they may have different problems at different points.

Thanks Minh

jeffdaily commented 5 years ago

The seqan docs show that the alignment options are passed to the global alignment routine. That is probably the correct design, but ... I don't have a lot of time to get this implemented nor do I want to change interfaces, etc.

I think I'll go with some functions to set the semi-global flags system-wide such that the flags can be updated as needed. I'll start a new feature branch for this work. I'm not making any promises w.r.t. how fast I can get this done. I'm doing this work in off hours as a hobby now.

jeffdaily commented 5 years ago

I've thought about this for a few days now and I'm almost inclined to create additional functions. I can avoid copy-and-paste code duplication by implementing an internal semi-global routine that takes additional flags. Then, additional user-visible routines would call the internal routine with appropriate settings. Yes, this would create a lot more functions. 8x more semi-global functions to cover all cases. But it keeps with my original design for parasail.

Now, what should these new routines be called? I will keep parasail_sg for backwards compatibility reasons. We need 8 new names to cover the following cases. A first attempt at naming is suggested, but I would very much appreciate your own suggestions for names.

Gaps are penalty-free at Suggestion 1 Suggestion 2 Suggestion 3
beginning of s1/query parasail_sg_s1b parasail_sg_qb (same as 2)
beginning of s2/database parasail_sg_s2b parasail_sg_db (same as 2)
end of s1/query parasail_sg_s1e parasail_sg_qe (same as 2)
end of s2/database parasail_sg_s2e parasail_sg_de (same as 2)
beginning and end of s1/query parasail_sg_s1be parasail_sg_qbe parasail_sg_qx
beginning and end of s2/database parasail_sg_s2be parasail_sg_dbe parasail_sg_dx
beginning of s1/query and end of s2/database parasail_sg_s1b_s2e parasail_sg_qb_de parasail_sg_qb_de
beginning of s2/database and end of s1/query parasail_sg_s1e_s2b parasail_sg_qe_db parasail_sg_qe_db

Or even more succinctly, the following idea where I can represent all new combinations using two letters.

The use of 'global' and 'local' seems to follow some conventions in other packages where they name some routines as 'glocal' or 'global-local'. Here's a new table of suggestions.

Gaps are penalty-free Suggestion
beginning of s1/query, none of s2 parasail_sg_bg
beginning of s2/database, none of s1 parasail_sg_gb
end of s1/query, none of s2 parasail_sg_eg
end of s2/database, none of s1 parasail_sg_ge
beginning and end of s1/query, none of s2 parasail_sg_lg
beginning and end of s2/database, none of s1 parasail_sg_gl
beginning of s1/query and end of s2/database parasail_sg_be
beginning of s2/database and end of s1/query parasail_sg_eb

So given the above, we wouldn't have a 'parasail_sg_gg' because that would be the same as just 'parasail_nw', global alignment.

Thank you for your input.

mdcao commented 5 years ago

That is so wonderful. I personally like suggestion 2 (or 3) as it is systematic and easy to follow. The last suggestion can give more succinct naming but it involves global and local which could create more confusions. Look forward to these functions with thanks. Minh

jeffdaily commented 5 years ago

I have started my implementation. Of course it's more work than I anticipated, but it is still feasible. Branch: https://github.com/jeffdaily/parasail/tree/feature/sg I've only currently created empty implementations of the new semi-global routines. I went with suggestion 3 for naming.

jeffdaily commented 5 years ago

Here's a sample of new outputs using your original s1 (query) and s2 (database/target).

// do not penalize query begin

optimal_alignment_score: -32
Target:          1 ACACAGAAAAGGAACC-CGCACTCCATC      27
                      |||||*||| ||| || ||*|  ||
Query:           1 ---CAGAACAGG-ACCACG-ACAC--TC      21

// do not penalize query end

optimal_alignment_score: -32
Target:          1 ACACAGAAAAGGAACC-CGCACTCCATC      27
                      |||||*||| ||| || ||*|  ||
Query:           1 ---CAGAACAGG-ACCACG-ACAC--TC      21

// do not penalize query beginning or end

optimal_alignment_score: -32
Target:          1 ACACAGAAAAGGAACC-CGCACTCCATC      27
                      |||||*||| ||| || ||*|  ||
Query:           1 ---CAGAACAGG-ACCACG-ACAC--TC      21

// do not penalize database begin

optimal_alignment_score: -17
Target:          1 ACACAGAAAAGGAACC-CGCACTCCATC      27
                      |||||*||| ||| || ||*|  ||
Query:           1 ---CAGAACAGG-ACCACG-ACAC--TC      21

// do not penalize database end

optimal_alignment_score: -17
Target:          1 ACACAGAAAAGGAACC-CG-CACTCCATC      27
                      |||||*||| ||| || |||||
Query:           1 ---CAGAACAGG-ACCACGACACTC----      21

// do not penalize database beginning or end

optimal_alignment_score: -2
Target:          1 ACACAGAAAAGGAACC-CG-CACTCCATC      27
                      |||||*||| ||| || |||||
Query:           1 ---CAGAACAGG-ACCACGACACTC----      21

// do not penalize query beginning or database end

optimal_alignment_score: -2
Target:          1 -----------------ACACAGAAAAGGAACCCGCACTCCATC      27
                                    ||*|
Query:           1 CAGAACAGGACCACGACACTC-----------------------      21

// do not penalize query end or database begin

optimal_alignment_score: 1
Target:          1 ACACAGAAAAGGAACCCGCACTCCATC--------------------      27
                                             |
Query:           1 --------------------------CAGAACAGGACCACGACACTC      21
jeffdaily commented 5 years ago

Would you please review the changes to the README.md?

https://github.com/jeffdaily/parasail/blob/feature/sg/README.md

I want to make sure it is clear now that there are many semi-global variations, and how to use them.

mdcao commented 5 years ago

Look perfectly clear to me. Really comprehensive.

In bioinformatics, the term semi-global alignment might mean different things to different people. You may want to put a pointer to the PDF lecture we looked at before, so the readers will have a clear idea of what you mean by semi-global alignments and their variations.

Thank again, and look forward to testing those functions. -M

jeffdaily commented 5 years ago

I have merged the new functions into the develop branch. I will work on a C library release as well as updating the python wrappers.

jeffdaily commented 5 years ago

C library released today. Updated python bindings will also release today.