heyimalex / bitap

Fuzzy string search algorithm in Rust
9 stars 0 forks source link
bitap fuzzy-search rust

Bitap

Crates.io

Implementation of the bitap algorithm for fuzzy string search in Rust.

If you have a bunch of text to search through and a substring that you're looking for, bitap can efficiently find all places in the source text that are at most k edits away from the search string, where "edits" are in terms of Levenshtein distance or optimal string alignment distance. There's a small upfront cost, but the runtime is O(nk) in practice, which is pretty solid if that's what you need.

Usage

Compile a Pattern and then use it to search.

use bitap::{Pattern,Match};

// Compile the pattern you're searching for.
let pattern = Pattern::new("wxrld")?;

// Run one of the search functions:
// - pattern.lev for levenshtein distance matching
// - pattern.osa for optimal string alignment distance matching
// The second parameter determines the maximum edit distance
// that you want to return matches for.
let max_distance = 1;
let matches = pattern.lev("hello world", max_distance);

// Horray!
assert_eq!(matches.next(), Some(Match{ distance: 1, end: 10 }));

Limitations

Adapters

The Pattern struct handles most common usages of bitap; fuzzy searching for unicode patterns in unicode text. But it's not perfect, and bitap itself is generalizable to a much broader range of problems.

Luckily, the core of bitap is actually representable in a way that doesn't care about whether you're dealing with code points or graphemes or even nucleotides, and I can punt all those concerns to someone who cares!

They key insight is that the main algorithm works on an iterator of pattern masks. Bitap can then be implemented as a iterator adapter that takes in Iterator<Item = usize> and returns an iterator of matches. That's what the top level find, levenshtein and optimal_string_alignment functions are; you write the code that makes the pattern-mask iterator, they find the matches.

Static Variants

There are a couple of static versions of the iterator adapters, levenshtein_static and optimal_string_alignment_static. What's that about?

According to wikipedia:

In his seminal paper, Damerau stated that more than 80% of all human misspellings can be expressed by a single error of one of the four types.

Also, I did a lot of toying with algolia while thinking about fuzzy search, and they only allow up to two typos. So allowing up to two errors is probably a common enough case to optimize for, and we can eek out \~15% more performance by avoiding an allocation!

Match Highlighting

Bitap can unfortunately only tell you what index a match ends on. When edit distance is zero, the beginning of the match is trivially match.end - pattern_length + 1, but with edits it's that +- match.distance. This makes perfectly accurate highlighting a pain, but here are some strategies that have worked for me.

As I think about this more some it may be added into the crate.

Also, I should note that while I haven't figured out how to recover the beginning of the match from the internal bitap state, that doesn't mean it's impossible. Interested to see if anyone can come up with something!