universal-automata / liblevenshtein

Various utilities regarding Levenshtein transducers.
https://github.com/universal-automata/liblevenshtein
MIT License
67 stars 13 forks source link

Potential edgecase bug using "|" as delimiter in distance metrics #6

Closed dylon closed 10 years ago

dylon commented 11 years ago

I was looking at the distance metrics, and realized that there may be an edgecase in which using | as the delimiter for the key between the strings being compared may result in incorrect values. Take v1 = "|" and u1 = ""; the key for these will be key1 = "||". Then, take v2 = "" and u2 = "|"; the key for these will be key2 = "||".

Affected lines:

Look into this, and determine whether it will affect the correctness of the metrics. If it does, then consider escaping existing | characters (or whatever I decide to make the delimeter). Maybe even consider using the null character and flag it as reserved (escaping v and u may hurt performance when comparing terms in very large corpora).

dylon commented 10 years ago

Fixed in v1.1.1