helix-editor / nucleo

A fast and convenient fuzzy matcher library for rust
Mozilla Public License 2.0
899 stars 28 forks source link

Feature request: Better conversions between Utf32String and Utf32Str #50

Open bugnano opened 2 months ago

bugnano commented 2 months ago

For maximising performance of matching, I have a Vec<Utf32String> that I want to match against. It's fairly big (hundreds of thousands of records). For matching I do something like:

    let mut scores: Vec<(Utf32String, usize, u32, usize)> = entries
        .iter()
        .enumerate()
        .filter_map(|(i, file_name)| {
            let score = pattern.score(file_name.slice(..), matcher);

            score.map(|score| (file_name.clone(), i, score, file_name.len()))
        })
        .collect();

So that I can sort it like so:

    scores.sort_by(|(_file1, i1, score1, len1), (_file2, i2, score2, len2)| {
        score2.cmp(score1).then(len1.cmp(len2)).then(i1.cmp(i2))
    });

And finally, construct a new Vec<Utf32String> with the filtered result:

        scores
            .iter()
            .map(|(file_name, _i, _score, _len)| file_name.clone())
            .collect();

Notice that I do file_name.clone() twice.

Now, to avoid at least 1 clone, I was thinking of refactoring the scoring vector like so:

    let mut scores: Vec<(Utf32Str, usize, u32, usize)> = entries
        .iter()
        .enumerate()
        .filter_map(|(i, file_name)| {
            let slice = file_name.slice(..);
            let score = pattern.score(slice, matcher);

            score.map(|score| (slice, i, score, file_name.len()))
        })
        .collect();

So far so good, but then I found no way of converting an Utf32Str to Utf32String for the final Vec<Utf32String>

It would be nice to have, first of all an AsRef<Utf32Str> for Utf32String, so that I don't have to call .slice(..), next it would be awesome to have a From<Utf32Str> for Utf32String, so that I can do something like Utf32String::from(file_name) in the final Vec<Utf32String>

alexrutar commented 2 weeks ago

As far as I understand the codebase, I don't think AsRef is sensible to implement. The problem with AsRef is that the trait implementation would have to look something like

pub trait AsRef<Utf32Str<'_>> {
    // Required method
    fn as_ref(&self) -> &Utf32Str<'_> {
        &self.slice(..)
    }
}

but this does not work since the Utf32Str is then immediately dropped.

More generally though, it seems to me that you are treating Utf32String and Utf32Str as literal string types and (despite the suggestive names) this is not really correct in my opinion. This is described in the documentation (though certainly it should be clarified more), but Utf32String is internally an iterator over the first char of each grapheme cluster. This means that the conversion String -> Utf32String is lossy. So it only really makes sense to discuss a Utf32String as a sort of "match object".

Here is an example straight from UAX #29:

let s = "क्षि";
let utf32_s = Utf32String::from(s);

for g in s.graphemes(true) {
    println!("{g} {:?}", g.chars());
}
for ch in utf32_s.slice(..).chars() {
    println!("{ch}");
}

In my opinion, the fact that Utf32String implements Display is semantically wrong: at least to me, it is better to think of it as an "internal String-like matching object". Maybe @pascalkuthe has a good reason for this?

Unfortunately, rendering of grapheme clusters is not handled correctly by many terminals anyway, so actually doing things properly in your own code does not guarantee that things will go as expected...

In your code examples, I would just use your own types and your own Display implementation and never use Utf32String inside your own code. In other words, having a Vec<Utf32String> in the first place is (in my opinion) quite weird. I think in most cases the only time when you should ever be touching a Utf32String is in the mutable closure passed to the Injector::push method.

I hope this is helpful!

bugnano commented 2 weeks ago

It's somewhat helpful, but I'm not satisfied with it.

The conversion String -> Utf32String is technically lossy, but for my usecase I don't really care.

What I was asking is for Utf32String -> Utf32Str to be consistent with the relationship that there exists with String -> str, OsString -> OsStr, and so on, which means, for example that if a function expects &str as its parameter, I can pass &String, because String implements AsRef<str>.

Now, if a function expects Utf32Str, I expect to be able to pass it a reference to Utf32String, which is not the case today.