leontoeides / indicium

A simple in-memory search for collections and key-value stores.
https://crates.io/crates/indicium
Apache License 2.0
64 stars 3 forks source link

Basic Unicode support #2

Closed lain-dono closed 5 months ago

lain-dono commented 6 months ago

Strings in rust contain utf-8 encoded characters. Which is a variable character length encoding. So

let mut index = indicium::simple::SearchIndex::<usize>::default();
index.search("лол"); // lol in Cyrillic

causes an error

thread 'main' panicked at indicium-0.6.1/src/simple/internal/strsim/strsim_context_autocomplete.rs:51:30:
byte index 3 is not a char boundary; it is inside 'о' (bytes 2..4) of `лол`

Here's an example of an internal representation:

lol: [108, 111, 108]
лол: [208, 187, 208, 190, 208, 187]
leontoeides commented 5 months ago

Thank you for reporting this @lain-dono. This has been corrected, I've added a test, and a new version has been published.