Open schmohlio opened 5 years ago
Note Splitter receives Strings, which are always UTF-16, not byte arrays with variable encodings. The option to split on a fixed number of code points instead of chars makes sense, though.
Thanks, I've idly wondered if we had more APIs that, like CharMatcher
, use the char
rather the code point as a unit.
That said: Even code points, as I understand it, are not necessarily the right unit. Do you have any details about a use case that you can share? That would help us prioritize this. (I should say, though, that fixedWidth
is already a fairly niche utility, so we might not get around to this regardless.)
That said: Even code points, as I understand it, are not necessarily the right unit.
I understand from that article that the correct (or as correct as possible) unit is extended grapheme clusters, which I also understand is what Swift uses for its Character
data type.
true, encoding probably doesn't matter since UTF-16 already fits all codepoints.
But I think there is a use case for splitting on codepoints:
Splitter.fixedLength(1).split("😃😃";)
currently returns ['?', '?', '?', '?']
A believe a fluent soln by code point would be useful:
Splitter.fixedLength(1).codePoints().split("😃😃";)
returning ["😃","😃"]
A believe a fluent soln by code point would be useful:
I'd strongly argue for it to be based on extended grapheme clusters rather than code points, if at all technically feasible. But other than that, I agree. :)
Could the fluent api allow you to specify char
, codepoint
, or grapheme
?
As an aside, I would love to help. I wasn't able to find a contribution README, though.
As an aside, I would prioritize codepoints because they are a first class citizen in java, i.e. String::codePointAt(int)
Are there any contexts in which splitting on code points would be preferable to splitting on grapheme clusters?
Currently
Splitter.fixedLength
splits strings based on chars. It would be nice if the splitter had a configurable encoding, such as utf-8, for codepoints.