Handle codepoints - Githubissues

ogregoire commented 3 years ago

Codepoints are more and more present rather than chars. Working with them is hard because the Java API didn't really think about them and we're left with a handful of methods, not even in the same place.

Something that would be nice to start from somewhere is a kind of CodepointStream, and then expand on that. I'm not talking about String::codepoints, but rather about a new kind of Reader:

abstract class CodepointStream implements Closeable {
  abstract int read() throws IOException;
  abstract int read(int[] buffer);
}
class ReaderCodepointStream extends CodepointStream {
  private final Reader delegate;
  ReaderCodepointStream(Reader reader) { delegate = requireNonNull(reader); }
  int read() {
    int high = delegate.read();
    if (high == -1 || !Character.isHighSurrogate((char) high)) {
      return high;
    }
    int low = delegate.read();
    if (low == -1 || !Character.isLowSurrogate((char) low)) {
      throw new IOException("Invalid surrogate pair");
    }
    return Character.toCodePoint((char) high, (char) low);
  }
  int read(int[] buffer) {
    // Implement as efficiently as possible, merging characters when a high/low pair is encountered.
  }
  void close() { reader.close(); }
}

And maybe later extend this with tool objects like CodepointSource/Sink?

jbduncan commented 3 years ago

@ogregoire Out of curiosity, do you use code points or grapheme clusters more?

For context, my understanding is that a grapheme cluster can represent any user-perceived character, whereas code points can only represent some. For example, this blog post explains that the "flag emoji “🇺🇸” is [...] made of two code points, 🇺 + 🇸". (Swift's Characters are extended grapheme clusters, apparently.)

The reason I say all this is because I'm wondering if introducing an API that works with grapheme clusters rather than code points would encourage people to do the "right thing" when they're processing characters.

What do you think?

ogregoire commented 3 years ago

@jbduncan I had to look what a grapheme cluster is, but actually, that's something I'm kind reimplementing. I'm mostly parsing Unicode to tokenize sentences into a sequence of words, ideograms, emojis, etc. but I guess my implementations would be much easier with grapheme clusters.

jbduncan commented 3 years ago

Cool! You can use ICU4J's BreakIterator to get the words and grapheme clusters in your strings.

(I don't really understand why those methods have different versions that either do or don't accept a locale. By comparison, Rust's unicode-segmentation crate, which does the same thing as BreakIterator, doesn't accept the Rust equivalent of a locale.)

However BreakIterator is a bit hard to use, so you may have a better time if either after using it, you store the parsed characters/words in List<String>s, or you wrap usages of BreakIterator in a custom lazy Iterable<String>.

ogregoire commented 3 years ago

@jbduncan Thanks! This is indeed useful. I'll try to figure out how to get a BreakIterator for a Reader without putting everything into memory. Making an Iterator around BreakIterator seems easy in comparison. Should this ticket be closed or is there anything salvageable for Guava?

cpovirk commented 3 years ago

As a general rule, we have tried to stay away from internationalization:

That's not because it's unimportant but rather because it's too important to be handled by non-experts such us as.
Any real internationalization requires an ICU4J dependency. And, yes, I know that we're already not dependency-free (and I know this has caused people pain), but we are trying to be deliberate about growing our scope in ways that would require runtime dependencies.

That said, we do occasionally try to provide a baseline level of not being gratuitously incompatible with internationalization, like how Strings.commonPrefix won't break in the middle of a surrogate pair. But that's hardly true internationalization, and I wonder sometimes if it's actually worse for us to have "partial" Unicode handling than to have none at all.

CodepointStream probably falls into that "partial" bucket. Probably it's a little better than commonPrefix, though, since it at least lets users operate on the code points however they want, rather than directly splitting them up in a potentially incorrect way. I don't think it will be a priority for us, but it doesn't immediately strike me as so obviously out of scope that I feel obligated to close the issue entirely :)

(FWIW we do have an internal API that uses BreakIterator to present an Iterable<String> view of an input text, broken by characters, lines, sentences, or words. I suspect that there are many cases in which such convenience wrappers around ICU4J could be helpful. (In fact, I can think of some wrappers for date-time formatting that we have, too.) That much is out of scope for Guava, but ideally such a thing would exist somewhere.)

falconetpt commented 3 years ago

Will pick this one if you don't mind :)

google / guava

Handle codepoints #5411