Support specifying alternative strategies for handling unicode

seanpianka commented 3 years ago

Opened based on the discussion in #23 and in this comment.

"Well, I'm dissatisfied with the options available for handling unicode. In the same way we allow alternative scorers, we should allow callers to choose how unicode is handled.

You could imagine a few different strategies: byte-level comparisons (essentially our original implementation sans panics), this new approach which is mostly at the 'char' level (need to double check if there are any bits that are still byte-level like match_indices), or one based on unicode normalization of grapheme clusters."

logannc commented 3 years ago

There are two levels to possible strategies:

Segmentation:

Bytes
Codepoints (USVs) - turns out these are equivalent if we use rust strings as our input
Graphemes - what people think of as characters
Others that we may omit such as sentence and word segmentation but maybe?

Segmentation strategies for Unicode are defined by https://www.unicode.org/reports/tr29/ and implemented by https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html (where not already available - e.g., bytes and codepoint segmentation are casting to [u8] and str.chars() respectively).

Normalization:

No Normalization
Unicode Normalization Form C
Unicode Normalization Form KC
Unicode Normalization Form D
Unicode Normalization Form KD
... the previous 4 preceeded by CJK Compatibility Ideograph normalization

Normalization strategies are outlined by https://www.unicode.org/reports/tr15/ which are all implemented by https://docs.rs/unicode-normalization/0.1.17/unicode_normalization/trait.UnicodeNormalization.html.

I propose that both levels of strategy be represented by

Interfaces for input into our library (e.g., Segmentor = Fn(str) -> Iterator<Item=str> and Normalizer = Fn(str) -> Iterator<Item=char>)
Reexport these default implementations from these libraries, probably as a default feature so those (probably large) Unicode libraries are easy to use by default but you can disable it if bloat is your concern.

logannc commented 3 years ago

Implementing both of these means that the core logic functions for computing similarity won't need to care which type they're dealing with. Give them a Vec<T> where T: Eq and it's good to go. Contrast this with the hoops #23 jumps through because the index doesn't point to the whole item. With the right segmentation, you'd just have &str slices which naturally encompassed their length.

logannc commented 3 years ago

Played around with these ideas.

in hindsight, it should just produce Vec because we need slices anyway so the iterator isnt enough but it might be able to be Vec<&str> which will ultimately be pretty good for graphemes.

I also re-implemented some of the building blocks (i.e., get_matching_blocks and find_longest_match) using slices, which works.

It's hard to offhand find good examples of the differences between code point and grapheme segmentation, but I found a few that give different results (unsure how sound those results are).

Haven't done the normalizer, but that part is easier. The normalizer will need to run before segmentation (if you segment into bytes, you can't normalize!).

Some of the trait implementations. In practice, they'll be less gnarly when moved from Iterators -> Vec.

pub trait Segmenter<'a> {
    type IterOut: Iterator<Item = Self::Output>;
    type Output: 'a + Eq;
    fn segment(&self, s: &'a str) -> Self::IterOut;
}

pub struct ByteSegmenter;

impl<'a> Segmenter<'a> for ByteSegmenter {
    type IterOut = std::slice::Iter<'a, u8>;
    type Output = &'a u8;
    fn segment(&self, s: &'a str) -> Self::IterOut {
        s.as_bytes().iter()
    }
}

pub struct DotSegmenter;

impl<'a> Segmenter<'a> for DotSegmenter {
    type IterOut = std::str::Split<'a, &'static str>;
    type Output = &'a str;
    fn segment(&self, s: &'a str) -> Self::IterOut {
        s.split(".").into_iter()
    }
}

pub struct CodePointSegmenter;

impl<'a> Segmenter<'a> for CodePointSegmenter {
    type IterOut = std::str::Chars<'a>;
    type Output = char;
    fn segment(&self, s: &'a str) -> Self::IterOut {
        s.chars()
    }
}

#[cfg(feature = "segmentation")]
pub struct GraphemeSegmenter;

#[cfg(feature = "segmentation")]
use unicode_segmentation::{Graphemes, UnicodeSegmentation};

#[cfg(feature = "segmentation")]
impl<'a> Segmenter<'a> for GraphemeSegmenter {
    type IterOut = Graphemes<'a>;
    type Output = &'a str;
    fn segment(&self, s: &'a str) -> Self::IterOut {
        s.graphemes(true)
    }
}

logannc commented 3 years ago

Oh, here is a good example, comparing "किमपि" to "किमप" (delete 1 off the end of the first string) in order of Bytes (89), Codepoints (89), Graphemes (67).

Iter([224, 164, 149, 224, 164, 191, 224, 164, 174, 224, 164, 170, 224, 164, 191]) ~ Iter([224, 164, 149, 224, 164, 191, 224, 164, 174, 224, 164, 170])
Chars(['क', 'ि', 'म', 'प', 'ि']) ~ Chars(['क', 'ि', 'म', 'प'])
["कि", "म", "पि"] ~ ["कि", "म", "प"]

logannc / fuzzywuzzy-rs

Support specifying alternative strategies for handling unicode #24