google / guava

Google core libraries for Java
Apache License 2.0
50.03k stars 10.86k forks source link

Splitter.fixedLength support for codepoints #3488

Open schmohlio opened 5 years ago

schmohlio commented 5 years ago

Currently Splitter.fixedLength splits strings based on chars. It would be nice if the splitter had a configurable encoding, such as utf-8, for codepoints.

lowasser commented 5 years ago

Note Splitter receives Strings, which are always UTF-16, not byte arrays with variable encodings. The option to split on a fixed number of code points instead of chars makes sense, though.

cpovirk commented 5 years ago

Thanks, I've idly wondered if we had more APIs that, like CharMatcher, use the char rather the code point as a unit.

That said: Even code points, as I understand it, are not necessarily the right unit. Do you have any details about a use case that you can share? That would help us prioritize this. (I should say, though, that fixedWidth is already a fairly niche utility, so we might not get around to this regardless.)

jbduncan commented 5 years ago

That said: Even code points, as I understand it, are not necessarily the right unit.

I understand from that article that the correct (or as correct as possible) unit is extended grapheme clusters, which I also understand is what Swift uses for its Character data type.

schmohlio commented 5 years ago

true, encoding probably doesn't matter since UTF-16 already fits all codepoints.

But I think there is a use case for splitting on codepoints:

Splitter.fixedLength(1).split("😃😃";) currently returns ['?', '?', '?', '?']

A believe a fluent soln by code point would be useful:

Splitter.fixedLength(1).codePoints().split("😃😃";) returning ["😃","😃"]

jbduncan commented 5 years ago

A believe a fluent soln by code point would be useful:

I'd strongly argue for it to be based on extended grapheme clusters rather than code points, if at all technically feasible. But other than that, I agree. :)

schmohlio commented 5 years ago

Could the fluent api allow you to specify char, codepoint, or grapheme?

As an aside, I would love to help. I wasn't able to find a contribution README, though.

schmohlio commented 5 years ago

As an aside, I would prioritize codepoints because they are a first class citizen in java, i.e. String::codePointAt(int)

lowasser commented 5 years ago

Are there any contexts in which splitting on code points would be preferable to splitting on grapheme clusters?