Open msuchane opened 2 years ago
With version 0.5.0, the tool now pre-selects using character-level trigrams. As a result, the search is about 10 times faster.
Word-level trigrams could produce more accurate results and might even be faster, but no library can currently calculate them.
I'm leaving this open to consider word-level trigrams in the future.
The slice::windows
method would be quite useful when implementing word trigrams.
Before comparing the file content using the Levenshtein or Jaro distance, first compare the two files using word-level trigrams to get the general sense of their similarity. Then, use the distance metric only on files that are relatively similar by trigrams.
Resources: