jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.61k stars 1.58k forks source link

Allow reverse to apply to a string #412

Open miracle2k opened 10 years ago

wtlangford commented 10 years ago

I like this, though, implementation will be difficult because of Unicode's combining character sequences. Things like "\u0061\u0304" (latin small letter a + combining macron) is technically two codepoints, but to properly reverse the string, those two need to stay in that order in the reversal, while "\u0101" (latin small letter a with macron) looks identical and would be simple to reverse. Anyone have any thoughts on this?

nicowilliams commented 10 years ago

My thoughts exactly. This really requires more of a Unicode library. Or at least the ranges of combining codepoints (though that isn't quite sufficient).

miracle2k commented 10 years ago

Is length handling unicode correctly?

wtlangford commented 10 years ago

That depends on your definition of correctly. Most libraries I've used count the number of codepoints, not the number of graphemes, which is what jq's length builtin does. I would say this is correct behavior.

wtlangford commented 10 years ago

My previous comment is misleading. I meant to say that jq, like the other libraries I've used (including the Objective-C and Java standard libraries, counts codepoints. So "\u0061\u0304" has a length of 2, while "\u0101" has a length of 1, even though both render as a single grapheme (looks like this: ā).

miracle2k commented 10 years ago

Sure. I was trying to make the point that if length doesn't handle graphemes, like most programming languages, it might be ok if reverse doesn't either (also like most programming languages).

nicowilliams commented 10 years ago

@wtlangford @miracle2k Most of the time counting codepoints is what you want, and anyways, it's the next cheapest operation after counting bytes. Counting characters is hard enough, and counting graphemes (if you include support for grapheme clusters) is more expensive still.

For string reversal you really want to distinguish characters, not codepoints. IMO anyways.

In the interim you can always do this:

def reverse_orig: reverse;
def reverse: if type == "string" then explode | reverse | implode else reverse_orig end;

and now you can reverse either strings or arrays without further ado. (And since we try to preserve object key order, we could even "reverse" objects, but let's not :)

This approach lets us off the hook for now.

In the longer term we might have a function that knows the combining codepoint ranges and deals with characters.

In the longer longer term we might want a bit of a Unicode library: for normalization, normalization-insensitive string comparison, grapheme cluster detection, grapheme counting, and so on. I'd rather not think about it for now :)

stedolan commented 10 years ago

Why does anyone ever want to reverse a string? Reversing a list, sure. Programming assignment to implement string reversing, sure. But in an actual program? It's not even a well-defined operation on a general (unicode) string.

If for some reason someone does want to reverse a string codepoint by codepoint, then converting to a list of codepoints, reversing that, and converting back doesn't seem like too much work.

wtlangford commented 10 years ago

Interestingly, I believe this is how most standard libraries do it anyways. Some of them have ways to make sure you're reversing composed character sequences properly (Objective-C's Foundation gives you substrings that represent each composed character sequence). But most just assume you know what you're getting into when you start reversing strings.

stedolan commented 10 years ago

The more I think about this, the more I like just providing conversion to and from a list of codepoints and list reversal. Programmers who reverse strings should acknowledge that they're doing something horrible by converting to a list of codepoints, rather than calling a library function that hides their sins :)

wtlangford commented 10 years ago

Yeah. Here there be demons.

nicowilliams commented 10 years ago

@stedolan We already have those converters: explode and implode.

nicowilliams commented 10 years ago

I'm thinking that it should be possible to write jq-coded functions to do this correctly by grouping codepoints that make up characters. It would require checking for all combining codepoint ranges, but that's not so bad. A filter on explode. I think a lot of advanced Unicode support, if we want it, could be jq-coded. (For the new import module facility, I've been thinking it'd be nice to have an import library data option, so that large Unicode tables could be stored as JSON instead of in .jq files.) I'd be much happier with that than with a dependency on some C Unicode library...