msuchane / near-facsimile

Find similar or identical files in Red Hat documentation
Apache License 2.0
0 stars 0 forks source link

Speed up using trigrams #4

Open msuchane opened 2 years ago

msuchane commented 2 years ago

Before comparing the file content using the Levenshtein or Jaro distance, first compare the two files using word-level trigrams to get the general sense of their similarity. Then, use the distance metric only on files that are relatively similar by trigrams.

Resources:

msuchane commented 2 years ago

With version 0.5.0, the tool now pre-selects using character-level trigrams. As a result, the search is about 10 times faster.

Word-level trigrams could produce more accurate results and might even be faster, but no library can currently calculate them.

I'm leaving this open to consider word-level trigrams in the future.

msuchane commented 2 years ago

The slice::windows method would be quite useful when implementing word trigrams.