Fix unicode panics - Githubissues

iwahbe commented 3 years ago

Slicing into arbitrary rust strings can panic. It assumes that all slices without overflow are valid, which is untrue. I provide an O(n) slice function into arbitrary rust strings. It does not panic, and otherwise behaves like rust slices. I also change len() to chars().count() as it also causes problems with unicode characters.

iwahbe commented 3 years ago

The deny(warnings) was removed to make development easier. I forgot I removed it. Sorry about the spelling mistakes. All tests but slice_at_the_end and empty fail with simple slices. They either panic or return the wrong slice.

iwahbe commented 3 years ago

I made most of the suggested changes (I ran out of energy for the scalar split test). The build now fails because I added a new test. There are two TODOs I would like you too look at. One of them is about flipping (it doesn't seem to matter). The other is on line 143. You use a std method, match_indices which does not do exactly what we need it to do for Unicode strings. I don't understand the algorithm enough to fix it right now. If you could take a look at that it would be excellent.

logannc commented 3 years ago

@seanpianka will you take a look? I'm not sure this should be our final approach to unicode support, but it is better than what we currently have.

seanpianka commented 3 years ago

@seanpianka will you take a look? I'm not sure this should be our final approach to unicode support, but it is better than what we currently have.

Besides the one question I've left, I don't have much to add here (besides "great work!" 😄).

As far as the unicode support for the v1.0 milestone, @logannc what work is left to do? I suppose we're missing unicode specific tests for the process module still? Or is there something more fundamental we're still missing?

logannc commented 3 years ago

As far as the unicode support for the v1.0 milestone, @logannc what work is left to do? I suppose we're missing unicode specific tests for the process module still? Or is there something more fundamental we're still missing?

Well, I'm dissatisfied with the options available for handling unicode. In the same way we allow alternative scorers, we should allow callers to choose how unicode is handled.

You could imagine a few different strategies: byte-level comparisons (essentially our original implementation sans panics), this new approach which is mostly at the 'char' level (need to double check if there are any bits that are still byte-level like match_indices), or one based on unicode normalization of grapheme clusters.

I'm unsure of what the resulting API should look like. Possibly a trait that exposes many methods and a ByteLevelStrategy, CharLevelStrategy, and a GraphemeLevelStrategy that all implement the trait (or you choose the strategy on init or something similar). Maybe something else entirely, I'm not sure, but I'm leaning towards something in this direction. With that in mind, I would want a refactor before 1.0 - maybe initially with a Fuzzywuzzy compatible strategy and we introduce the others in new releases.

logannc / fuzzywuzzy-rs

Fix unicode panics #23