question on edlib vs vsearch

jianshu93 commented 1 year ago

Hello vsearch team,

For all versus all searches (maxaccecpt 0 and maxreject 0), I am comparing it with edlib, which is edit distance library, meaning all insertion, deletion and substitution are the same scores (1), the NW (Needleman Wunsch, query and target fully aligned) mode was used. However, in terms of the best hits found for each query (e.g., top20), it can be different and I wondering which one is more to the truth. query is ~1500 bp and target are 1000 1500bp 16S ribosomal genes.

Thanks,

Jianshu

torognes commented 1 year ago

When performing all-vs-all searches with maxaccecpt 0 and maxreject 0, vsearch should perform a full global alignment (Needleman-Wunsch) between all sequences. However, when comparing the other alignment implementations, it is important to make sure that the scores and gap penalties are exactly the same. In particular, vsearch by default uses very low terminal gap penalties, which makes the alignment look more like a semi-global alignment. The use of different gap penalties in the ends as compared to in the middle of the alignment, is a bit special and not included in many other implementations. I am not sure how this is with edlib. This may be the cause for the discrepancies you observe.

Other potential causes for differences is the handling of nucleotide symbols other than A, C, G, and T.

frederic-mahe commented 1 year ago

I am going to close that issue. Feel free to re-open if need be.

torognes / vsearch

question on edlib vs vsearch #499