Open miracle2k opened 10 years ago
My thoughts exactly. This really requires more of a Unicode library. Or at least the ranges of combining codepoints (though that isn't quite sufficient).
Is length
handling unicode correctly?
That depends on your definition of correctly. Most libraries I've used count the number of codepoints, not the number of graphemes, which is what jq's length builtin does. I would say this is correct behavior.
My previous comment is misleading. I meant to say that jq, like the other libraries I've used (including the Objective-C and Java standard libraries, counts codepoints. So "\u0061\u0304" has a length of 2, while "\u0101" has a length of 1, even though both render as a single grapheme (looks like this: ā).
Sure. I was trying to make the point that if length
doesn't handle graphemes, like most programming languages, it might be ok if reverse
doesn't either (also like most programming languages).
@wtlangford @miracle2k Most of the time counting codepoints is what you want, and anyways, it's the next cheapest operation after counting bytes. Counting characters is hard enough, and counting graphemes (if you include support for grapheme clusters) is more expensive still.
For string reversal you really want to distinguish characters, not codepoints. IMO anyways.
In the interim you can always do this:
def reverse_orig: reverse;
def reverse: if type == "string" then explode | reverse | implode else reverse_orig end;
and now you can reverse either strings or arrays without further ado. (And since we try to preserve object key order, we could even "reverse" objects, but let's not :)
This approach lets us off the hook for now.
In the longer term we might have a function that knows the combining codepoint ranges and deals with characters.
In the longer longer term we might want a bit of a Unicode library: for normalization, normalization-insensitive string comparison, grapheme cluster detection, grapheme counting, and so on. I'd rather not think about it for now :)
Why does anyone ever want to reverse a string? Reversing a list, sure. Programming assignment to implement string reversing, sure. But in an actual program? It's not even a well-defined operation on a general (unicode) string.
If for some reason someone does want to reverse a string codepoint by codepoint, then converting to a list of codepoints, reversing that, and converting back doesn't seem like too much work.
Interestingly, I believe this is how most standard libraries do it anyways. Some of them have ways to make sure you're reversing composed character sequences properly (Objective-C's Foundation gives you substrings that represent each composed character sequence). But most just assume you know what you're getting into when you start reversing strings.
The more I think about this, the more I like just providing conversion to and from a list of codepoints and list reversal. Programmers who reverse strings should acknowledge that they're doing something horrible by converting to a list of codepoints, rather than calling a library function that hides their sins :)
Yeah. Here there be demons.
@stedolan We already have those converters: explode
and implode
.
I'm thinking that it should be possible to write jq-coded functions to do this correctly by grouping codepoints that make up characters. It would require checking for all combining codepoint ranges, but that's not so bad. A filter on explode
. I think a lot of advanced Unicode support, if we want it, could be jq-coded. (For the new import module facility, I've been thinking it'd be nice to have an import library data option, so that large Unicode tables could be stored as JSON instead of in .jq files.) I'd be much happier with that than with a dependency on some C Unicode library...
I like this, though, implementation will be difficult because of Unicode's combining character sequences. Things like "\u0061\u0304" (latin small letter a + combining macron) is technically two codepoints, but to properly reverse the string, those two need to stay in that order in the reversal, while "\u0101" (latin small letter a with macron) looks identical and would be simple to reverse. Anyone have any thoughts on this?