Closed iwahbe closed 3 years ago
The deny(warnings)
was removed to make development easier. I forgot I removed it. Sorry about the spelling mistakes. All tests but slice_at_the_end
and empty
fail with simple slices. They either panic or return the wrong slice.
I made most of the suggested changes (I ran out of energy for the scalar split test). The build now fails because I added a new test. There are two TODO
s I would like you too look at. One of them is about flipping (it doesn't seem to matter). The other is on line 143
. You use a std
method, match_indices
which does not do exactly what we need it to do for Unicode strings. I don't understand the algorithm enough to fix it right now. If you could take a look at that it would be excellent.
@seanpianka will you take a look? I'm not sure this should be our final approach to unicode support, but it is better than what we currently have.
@seanpianka will you take a look? I'm not sure this should be our final approach to unicode support, but it is better than what we currently have.
Besides the one question I've left, I don't have much to add here (besides "great work!" 😄).
As far as the unicode support for the v1.0 milestone, @logannc what work is left to do? I suppose we're missing unicode specific tests for the process
module still? Or is there something more fundamental we're still missing?
As far as the unicode support for the v1.0 milestone, @logannc what work is left to do? I suppose we're missing unicode specific tests for the
process
module still? Or is there something more fundamental we're still missing?
Well, I'm dissatisfied with the options available for handling unicode. In the same way we allow alternative scorers, we should allow callers to choose how unicode is handled.
You could imagine a few different strategies: byte-level comparisons (essentially our original implementation sans panics), this new approach which is mostly at the 'char' level (need to double check if there are any bits that are still byte-level like match_indices), or one based on unicode normalization of grapheme clusters.
I'm unsure of what the resulting API should look like. Possibly a trait that exposes many methods and a ByteLevelStrategy
, CharLevelStrategy
, and a GraphemeLevelStrategy
that all implement the trait (or you choose the strategy on init or something similar). Maybe something else entirely, I'm not sure, but I'm leaning towards something in this direction. With that in mind, I would want a refactor before 1.0 - maybe initially with a Fuzzywuzzy compatible strategy and we introduce the others in new releases.
Slicing into arbitrary rust strings can panic. It assumes that all slices without overflow are valid, which is untrue. I provide an
O(n)
slice function into arbitrary rust strings. It does not panic, and otherwise behaves like rust slices. I also changelen()
tochars().count()
as it also causes problems with unicode characters.