jeffdaily / parasail

Pairwise Sequence Alignment Library
Other
241 stars 34 forks source link

Needleman-Wunsch with affine gap penalties #26

Closed iagorios closed 7 years ago

iagorios commented 7 years ago

Hi Jeff, do you plan do make available a function that implements NW-Gotoh?

jeffdaily commented 7 years ago

I apologize if I have misunderstood your question, but all functions in parasail use affine gap penalties. Recently, for the functions that return statistics, those functions were corrected to use affine penalties (they weren't previously, which was a bug). Are you experiencing results of using any of the "parasail_nw" et al routines that are incorrect w.r.t. an affine penalty in another program?

iagorios commented 7 years ago

I was getting different results from a algorithm that I'm working on. However, I've figured out that probably the author is using a blastn like approach, and maybe Parasail is using megablast. Am I correct?

jeffdaily commented 7 years ago

When I was developing parasail, I always would start by writing a non-vectorized, i.e., serial version of the routine. For example, when implementing Needleman-Wunsch, I started with "parasail_nw" and wrote that first. I would verify the implementation against EMBOSS using their online alignment tools (http://www.ebi.ac.uk/Tools/psa/). Once I was satisfied that it was getting the correct answer on approximately 10 sequences, I would then move on to write the vectorized versions. Do you have a known, well-vetted tool that I could automatically verify against parasail, perhaps as part of its test suite?

I've always used affine gap penalties for parasail. Note that when any of the algorithms open a gap, only the gap open penalty alone is applied. That could lead to differences with other programs that might apply open+extend when opening a new gap.

I'm not familiar with megablast, but from what I briefly read it is optimized for similar sequences. Parasail does not optimize for particular sequence characteristics.

Are you seeing these differences with the C library or with the Java bindings? I recall you had requested the Java bindings so it would be worthwhile to verify those bindings against the C library answers, just in case.

iagorios commented 7 years ago

I see. To validate our experiments we have used NCBI Blast (https://blast.ncbi.nlm.nih.gov/Blast.cgi).

Well, that probably is the reason behind the differences on my results and those ones obtained by Parasail, once we are applying both open and extended. How complex do you think is it if I decide to modify Parasail to apply open+extend?

And yes, it is optimized for some sequences.

I've got these differences with the java bindings. But again, I don't think Parasail is performing wrong calculations, it is probably a difference between our implementations.