unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
571 stars 56 forks source link

Add unicode_word_indices #91

Closed basile-henry closed 3 years ago

basile-henry commented 3 years ago

The PR adds a new iterator: UnicodeWordIndices (and the function unicode_word_indices). It is similar to UnicodeWords but also provides byte offsets for each word.

The motivation for this PR was making https://github.com/jonathandturner/reedline/pull/5 in which I used split_word_bound_indices and then filtered the result using logic that is internal to unicode_words. I believe that PR would have been trivial using unicode_word_indices. Hopefully it can also be useful to others.

Should I add more tests for unicode_word_indices? Or are the existing tests for unicode_words and the doc test for unicode_word_indices sufficient?

Manishearth commented 3 years ago

Retriggering GHA