Open ogregoire opened 3 years ago
@ogregoire Out of curiosity, do you use code points or grapheme clusters more?
For context, my understanding is that a grapheme cluster can represent any user-perceived character, whereas code points can only represent some. For example, this blog post explains that the "flag emoji “🇺🇸” is [...] made of two code points, 🇺 + 🇸". (Swift's Character
s are extended grapheme clusters, apparently.)
The reason I say all this is because I'm wondering if introducing an API that works with grapheme clusters rather than code points would encourage people to do the "right thing" when they're processing characters.
What do you think?
@jbduncan I had to look what a grapheme cluster is, but actually, that's something I'm kind reimplementing. I'm mostly parsing Unicode to tokenize sentences into a sequence of words, ideograms, emojis, etc. but I guess my implementations would be much easier with grapheme clusters.
Cool! You can use ICU4J's BreakIterator
to get the words and grapheme clusters in your strings.
(I don't really understand why those methods have different versions that either do or don't accept a locale. By comparison, Rust's unicode-segmentation
crate, which does the same thing as BreakIterator
, doesn't accept the Rust equivalent of a locale.)
However BreakIterator
is a bit hard to use, so you may have a better time if either after using it, you store the parsed characters/words in List<String>
s, or you wrap usages of BreakIterator
in a custom lazy Iterable<String>
.
@jbduncan Thanks! This is indeed useful. I'll try to figure out how to get a BreakIterator
for a Reader
without putting everything into memory. Making an Iterator
around BreakIterator
seems easy in comparison.
Should this ticket be closed or is there anything salvageable for Guava?
As a general rule, we have tried to stay away from internationalization:
That said, we do occasionally try to provide a baseline level of not being gratuitously incompatible with internationalization, like how Strings.commonPrefix
won't break in the middle of a surrogate pair. But that's hardly true internationalization, and I wonder sometimes if it's actually worse for us to have "partial" Unicode handling than to have none at all.
CodepointStream
probably falls into that "partial" bucket. Probably it's a little better than commonPrefix
, though, since it at least lets users operate on the code points however they want, rather than directly splitting them up in a potentially incorrect way. I don't think it will be a priority for us, but it doesn't immediately strike me as so obviously out of scope that I feel obligated to close the issue entirely :)
(FWIW we do have an internal API that uses BreakIterator
to present an Iterable<String>
view of an input text, broken by characters, lines, sentences, or words. I suspect that there are many cases in which such convenience wrappers around ICU4J could be helpful. (In fact, I can think of some wrappers for date-time formatting that we have, too.) That much is out of scope for Guava, but ideally such a thing would exist somewhere.)
Will pick this one if you don't mind :)
Codepoints are more and more present rather than chars. Working with them is hard because the Java API didn't really think about them and we're left with a handful of methods, not even in the same place.
Something that would be nice to start from somewhere is a kind of
CodepointStream
, and then expand on that. I'm not talking aboutString::codepoints
, but rather about a new kind ofReader
:And maybe later extend this with tool objects like CodepointSource/Sink?