unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
558 stars 57 forks source link

Consider an alternative method for UnicodeSegmentation::graphemes #22

Open nox opened 7 years ago

nox commented 7 years ago

The boolean argument is quite useless when reading the code, because foo.graphemes(true) doesn't state what that true thing is for.

tapeinosyne commented 7 years ago

It would be reasonable to make this a breaking change while the number of direct dependants is acceptably low. (Presently, there are 22 publicly listed on crates.io). I feel that Boolean blindness would be unflattering in a stable API.

mbrubeck commented 7 years ago

What should the new API look like? A few options:

  1. Replace the boolean with an enum, so the API becomes graphemes(Extended) or graphemes(Legacy).

  2. Deprecate graphemes, and introduce new methods extended_graphemes() and legacy_graphemes().

  3. Change graphemes() to always return extended grapheme clusters, and add a new method legacy_graphemes().

I don't like (1) because the calls become very long, it requires importing more names, and like the current API it doesn't offer much guidance. I don't like (2) for most of the same reasons.

I like (3) because it makes the common case shorter, guiding people to the method that is most often correct. The downside is that it does not allow a deprecation period, but I don't think this is a big problem in this case.

tapeinosyne commented 7 years ago
  1. Change graphemes() to always return extended grapheme clusters, and add a new method legacy_graphemes().

This would be my choice as well. I don't personally mind importing enums, but extended grapheme clustering is clearly recommended[1] by the relevant Unicode annex and it seems sensible to offer it as an implicit default.

[1] Per the Unicode Standard Annex #29:

the extended grapheme cluster boundaries are recommended for general processing, while the legacy grapheme cluster boundaries are maintained primarily for backwards compatibility with earlier versions of this specification

nox commented 7 years ago

+1 for 3/

tapeinosyne commented 7 years ago

(Note that, as of #23, this issue also applies to the newly added cursor API.)

raindev commented 6 years ago

An alternative option to introduce a more ergonomic API without breaking backwards compatibility is to use functions not grouped into a trait but accessible from the module directly. UnicodeSegmentation trait can then be deprecated and removed eventually. The trait is implemented for str so it won't be possible to call the methods directly on a string (i.e. s.graphemes() vs graphemes(s)) but that's pretty minor annoyance. A small additional benefit is that import will be less stuttering: use unicode_segmentation::graphemes instead of use unicode_segmentation::UnicodeSegmentation.

raindev commented 6 years ago

I'll be happy to submit a patch too. Or maybe @xfix is interested in updating #31. Would be good to move it forward.

upsuper commented 5 years ago

I would also support option 3. It would be a breaking change but I guess that's fine. We can probably release a 1.3.1 which translates the new API into the old, so that we can move forward gradually.

timClicks commented 3 years ago

Another +1 for option 3 here.