infoscout / weighted-levenshtein

Weighted Levenshtein library
MIT License
105 stars 26 forks source link

Self-substitution costs #21

Open veghp opened 4 years ago

veghp commented 4 years ago

Thank you for this great package that helps me in comparing short sequences (https://github.com/Edinburgh-Genome-Foundry/Examples/tree/master/SeqDistance).

I'm wondering if it would possible to add a feature: self-substitution costs. Currently the diagonal of the substitution matrix seems to be ignored.

To expand on this a bit, we use some characters to encode multiple characters (e.g. S = C or G), that is, to encode uncertainty. In this case the chance that two Ss encode the same letter is 50%, so the penalty score should be 0.5.

veghp commented 4 years ago

A current workaround is to replace all characters (ATCG...) in one of the strings to another set of characters (#@;&...) and define penalties between the two sets of characters (alphabets) -- at the cost of halving the number of allowed characters.