open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

Add a `WordIndices` struct #264

Open bbqsrc opened 5 years ago

bbqsrc commented 5 years ago

We have Words, WordBounds and WordBoundIndices but not WordIndices, and for a tokeniser for a spellchecker I'm working on, this would be extremely nice. :smile:

behnam commented 5 years ago

Thanks for filing this, @bbqsrc.

We haven't spent much time on the string-level API yet, hence the API not being extensive. No objects to add WordIndices: as always, PRs are welcome!


Also, IMHO we should also try to come up with better naming for these as a higher-level API. A WordIterator in this case may actually emit white-space-only or punctuation tokens, which are not words, per se.

Any ideas/suggestions are welcome! :)

projektir commented 5 years ago

There also seems to be some disagreement between the doc on the Words iterator and what it actually does. The doc says that the Words iterator should return only alphanumeric substrings, but Words actually returns all the substrings, and the alphanumeric part is accomplished by a filter that happens to be applied in all the tests and examples.

It would perhaps be beneficial for performance reasons to have a separate iterator that filters for alphanumeric characters from the beginning? To summarize the interfaces:

These use the current WordBounds iterator:

These would require a new iterator (that I'm interested in contributing):

Words would also drop its filter argument. Words is already an iterator and it seems trivial for users to add a .filter() on top.