Open seanpianka opened 3 years ago
There are two levels to possible strategies:
Segmentation:
Segmentation strategies for Unicode are defined by https://www.unicode.org/reports/tr29/ and implemented by https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/trait.UnicodeSegmentation.html (where not already available - e.g., bytes and codepoint segmentation are casting to [u8]
and str.chars()
respectively).
Normalization:
Normalization strategies are outlined by https://www.unicode.org/reports/tr15/ which are all implemented by https://docs.rs/unicode-normalization/0.1.17/unicode_normalization/trait.UnicodeNormalization.html.
I propose that both levels of strategy be represented by
Segmentor = Fn(str) -> Iterator<Item=str>
and Normalizer = Fn(str) -> Iterator<Item=char>
)Implementing both of these means that the core logic functions for computing similarity won't need to care which type they're dealing with. Give them a Vec<T> where T: Eq
and it's good to go. Contrast this with the hoops #23 jumps through because the index doesn't point to the whole item. With the right segmentation, you'd just have &str
slices which naturally encompassed their length.
Played around with these ideas.
in hindsight, it should just produce Vec
I also re-implemented some of the building blocks (i.e., get_matching_blocks and find_longest_match) using slices, which works.
It's hard to offhand find good examples of the differences between code point and grapheme segmentation, but I found a few that give different results (unsure how sound those results are).
Haven't done the normalizer, but that part is easier. The normalizer will need to run before segmentation (if you segment into bytes, you can't normalize!).
Some of the trait implementations. In practice, they'll be less gnarly when moved from Iterators
pub trait Segmenter<'a> {
type IterOut: Iterator<Item = Self::Output>;
type Output: 'a + Eq;
fn segment(&self, s: &'a str) -> Self::IterOut;
}
pub struct ByteSegmenter;
impl<'a> Segmenter<'a> for ByteSegmenter {
type IterOut = std::slice::Iter<'a, u8>;
type Output = &'a u8;
fn segment(&self, s: &'a str) -> Self::IterOut {
s.as_bytes().iter()
}
}
pub struct DotSegmenter;
impl<'a> Segmenter<'a> for DotSegmenter {
type IterOut = std::str::Split<'a, &'static str>;
type Output = &'a str;
fn segment(&self, s: &'a str) -> Self::IterOut {
s.split(".").into_iter()
}
}
pub struct CodePointSegmenter;
impl<'a> Segmenter<'a> for CodePointSegmenter {
type IterOut = std::str::Chars<'a>;
type Output = char;
fn segment(&self, s: &'a str) -> Self::IterOut {
s.chars()
}
}
#[cfg(feature = "segmentation")]
pub struct GraphemeSegmenter;
#[cfg(feature = "segmentation")]
use unicode_segmentation::{Graphemes, UnicodeSegmentation};
#[cfg(feature = "segmentation")]
impl<'a> Segmenter<'a> for GraphemeSegmenter {
type IterOut = Graphemes<'a>;
type Output = &'a str;
fn segment(&self, s: &'a str) -> Self::IterOut {
s.graphemes(true)
}
}
Oh, here is a good example, comparing "किमपि" to "किमप" (delete 1 off the end of the first string) in order of Bytes (89), Codepoints (89), Graphemes (67).
Iter([224, 164, 149, 224, 164, 191, 224, 164, 174, 224, 164, 170, 224, 164, 191]) ~ Iter([224, 164, 149, 224, 164, 191, 224, 164, 174, 224, 164, 170])
Chars(['क', 'ि', 'म', 'प', 'ि']) ~ Chars(['क', 'ि', 'म', 'प'])
["कि", "म", "पि"] ~ ["कि", "म", "प"]
Opened based on the discussion in #23 and in this comment.