Algorithms such as Sorenson-Dice Coefficient and Jaccard's Index are not meant to be used for textual input - they were originally designed to compare the similarity between two distinct sets.
However, the statistic was made to work with strings by splitting the strings into an array of "tokens":
const foo = "hello world";
const tokenizer = (input) => //some function here
console.log(tokenizer(foo));
// output :: ["He", "el", "ll", "lo", "o ", " w", "wo", "or", "rl", "ld"];
and then comparing the two resulting arrays using the aforementioned algorithm.
The flaw
However, this poses a critical issue: Anagrams.
Neither Dice's Coefficient nor Jaccard's Index put into play the order of the tokens in the set, which means that comparing anagrams to each other would most likely yield a near-perfect result, even if the intended word is completely different to the anagram.
This has the potential to significantly decrease the reliability and accuracy of this package (specifically the matchArray method), as anagrams may take priority over legitimate close matches
Proposed solution
One proposed solution may be to put Levenshtein Distance (or any other edit-distance based algos) into play. This can counteract the flaw with anagrams, as edit distance also considers the order of the letters.
Algorithms such as Sorenson-Dice Coefficient and Jaccard's Index are not meant to be used for textual input - they were originally designed to compare the similarity between two distinct sets.
However, the statistic was made to work with strings by splitting the strings into an array of "tokens":
and then comparing the two resulting arrays using the aforementioned algorithm.
The flaw
However, this poses a critical issue: Anagrams. Neither Dice's Coefficient nor Jaccard's Index put into play the order of the tokens in the set, which means that comparing anagrams to each other would most likely yield a near-perfect result, even if the intended word is completely different to the anagram.
This has the potential to significantly decrease the reliability and accuracy of this package (specifically the
matchArray
method), as anagrams may take priority over legitimate close matchesFor example
Proposed solution One proposed solution may be to put Levenshtein Distance (or any other edit-distance based algos) into play. This can counteract the flaw with anagrams, as edit distance also considers the order of the letters.