saberscientist / compare-string

----
Other
1 stars 0 forks source link

Anagrams #7

Open saberscientist opened 3 years ago

saberscientist commented 3 years ago

Algorithms such as Sorenson-Dice Coefficient and Jaccard's Index are not meant to be used for textual input - they were originally designed to compare the similarity between two distinct sets.

However, the statistic was made to work with strings by splitting the strings into an array of "tokens":

const foo = "hello world";
const tokenizer = (input) => //some function here
console.log(tokenizer(foo));
// output :: ["He", "el", "ll", "lo", "o ", " w", "wo", "or", "rl", "ld"];

and then comparing the two resulting arrays using the aforementioned algorithm.

The flaw

However, this poses a critical issue: Anagrams. Neither Dice's Coefficient nor Jaccard's Index put into play the order of the tokens in the set, which means that comparing anagrams to each other would most likely yield a near-perfect result, even if the intended word is completely different to the anagram.

This has the potential to significantly decrease the reliability and accuracy of this package (specifically the matchArray method), as anagrams may take priority over legitimate close matches

For example

const { compareString } = require('compare-string');
const { compareTwoStrings }  = require('string-similarity');

console.log(compareTwoStrings('hello world', 'world hello'));
// output :: 0.8888888888888
console.log(compareString('hello world', 'world hello'));
// output :: 0.766666666666

Proposed solution One proposed solution may be to put Levenshtein Distance (or any other edit-distance based algos) into play. This can counteract the flaw with anagrams, as edit distance also considers the order of the letters.

saberscientist commented 2 years ago

8 will address this.