jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.53k stars 1.58k forks source link

Converting non-decimal numerical strings to numbers; where do you want it? #1632

Open JesseTG opened 6 years ago

JesseTG commented 6 years ago

I'd like to be able to convert numeric strings in bases besides 10 (e.g. "0xdeadbeef" or "0755") to numbers. I don't need to convert literals, just string values. There's no good way to do this in jq right now, so I'm going to write a built-in soon.

But where would you want it? I could either modify the tonumber builtin or write a new one that wraps strtoul. Which would you prefer?

pkoppstein commented 6 years ago

Since jq supports multi-arity functions, you would probably want to add functions with arities that are not already defined, e.g. tonumber(base).

See also https://rosettacode.org/wiki/Non-decimal_radices/Convert#jq

JesseTG commented 6 years ago

Okay, so I'll write a new built-in but make it look like tonumber. Is that okay?

pkoppstein commented 6 years ago

Is that okay.

First, please understand that it is not for me to decide what is added to jq. Second, please note that there is a queue of worthy Pull Requests that have not yet made it into jq, so if you are proposing a new Pull Request, please be aware that it will probably join the queue. Third, I believe it would probably be worth your while to specify more precisely what you propose.

JesseTG commented 6 years ago

First, please understand that it is not for me to decide what is added to jq.

Right, excuse me.

Second, please note that there is a queue of worthy Pull Requests that have not yet made it into jq, so if you are proposing a new Pull Request, please be aware that it will probably join the queue.

That's fine.

Third, I believe it would probably be worth your while to specify more precisely what you propose.

A new filter (or a new variety of tonumber) that behaves like this:

$ echo \"0x40\" | jq tonumber
64
$ echo \"0755\" | jq tonumber
493
$ echo \"40\" | jq tonumber(16)
64
$ echo \"40\" | jq tonumber(7) # base 7
28

My main use case for this would be converting string representations of colors to integers. I technically only need hex -> decimal conversions, but given that this would basically be a rapper around strtol you'd be getting the other bases for free.

For the sake of completeness, I'd also provide the inverse:

$ echo 64 | jq hex
"0x40"
$ echo 493 | jq oct
"0755"
$ echo 64 | jq base(16)
"0x40"
$ echo 28 | jq base(7)
"28"

I would probably implement this as a C built-in that wraps printf, with a set of filters in the standard library that use that builtin.

pkoppstein commented 6 years ago

The major problem that I see is that your proposal introduces a backward incompatibility, because currently:

echo '"0755"' | jq tonumber
755

From wikipedia:

In programming languages, octal literals are typically identified with a variety of prefixes, including the digit 0, the letters o or q, the digit–letter combination 0o, or the symbol & or $.

The maintainers are also currently concerned about the absolute number of builtin.jq builtins, for performance reasons. In your initial PR, you may therefore want to keep the number of additional such builtins to a bare minimum.

JesseTG commented 6 years ago

The major problem that I see is that your proposal introduces a backward incompatibility, because currently:

echo '"0755"' | jq tonumber
755

Okay, so I'd leave 0 prefixes alone except on an opt-in basis (maybe if you're explicitly asking for base 8). Man, whoever coined that prefix for octal needs to be smacked.

The maintainers are also currently concerned about the absolute number of builtin.jq builtins, for performance reasons. In your initial PR, you may therefore want to keep the number of additional such builtins to a bare minimum.

That's fine. In fact, it looks like I'm suggesting more than I really am. I'm only suggesting two C builtins (let's call them tonumber/1 and base/1) and three trivial jq builtins that wrap common cases. In fact, here's what the jq builtins would look like:

def hex: base(16);
def oct: base(8);
def bin: base(2);

Maybe add one more of each, depending on how I use sprintf.

wtlangford commented 6 years ago

Man, whoever coined that prefix for octal needs to be smacked.

You are not wrong.

That's fine. In fact, it looks like I'm suggesting more than I really am. I'm only suggesting two C builtins (let's call them tonumber/1 and base/1) and three trivial jq builtins that wrap common cases.

That's... still 5 builtins from the perspective we're worried about. If I were to do this as you've proposed, I'd just implement the tonumber/1 and base/1, and leave it to people to define shortcut builtins if they need it. Linking is currently something like O(n^2), so we like to avoid adding more builtins than are necessary.

Relatedly, I'm not in love with the base/1 name. I think it's a little unclear, but I don't have a recommendation on something better...

On a more general note, I should point out that tonumber/0 takes a string which is interpreted as a JSON-encoded representation of a number, which includes such things as -5e+77 and 5e-77. How would these interact with tonumber(8)?

JesseTG commented 6 years ago

Linking is currently something like O(n^2), so we like to avoid adding more builtins than are necessary.

Where n is the number of built-ins written in jq? Yikes. Why is this?

Relatedly, I'm not in love with the base/1 name. I think it's a little unclear, but I don't have a recommendation on something better...

radix/1, maybe?

On a more general note, I should point out that tonumber/0 takes a string which is interpreted as a JSON-encoded representation of a number, which includes such things as -5e+77 and 5e-77. How would these interact with tonumber(8)?

Here are my thoughts.

Special Cases

Typically when you're dealing with numbers in multiple bases, you know in advance which ones you're using. So I think the argument to tonumber would usually be a constant in practice. Given that, I think it's okay to be liberal with special cases. If you really want it, I can add a flag that toggles special cases on a per-call basis.

Representations

It doesn't matter what the representation is, just the value. 1.2e3 and 1200 are both twelve hundred, which is an integer. Any integer between -(2**53) and (2**53) - 1 (inclusive) can be properly represented in a double.

Floats

Non-integers are tricky. The only thing I can really think of is to consider non-integer inputs to base/1 an error, with these exceptions:

Prefixes

araspik commented 4 years ago

This would be very useful! Has there been any work done?

A few suggestions:

  1. Octals can use a 0o prefix, and binary can use 0b. This would match up nicely with the hexadecimal prefix.
  2. Digits after a decimal can be formatted both ways exactly like integer digits. 0x1.8 is equal to 1.5. Implementing e<pwr> for all bases raises a lot of questions to be simple, and so it could simply be not supported (e.g: What base is <pwr> in? Does it use base 10 (as some people expect) or the current base? What base does the actual multiplier use? (i.e 10**<pwr> or <base>**<pwr>) etc. etc.)
  3. Note: I (for some reason) have a lot of areas where I need to convert an arbitrary list of hexadecimal (integer) strings to and from decimal base.
  4. radix is not a bad name.
BelfordZ commented 4 years ago

Any word on this one? Very excited to be able to convert hex strings to numbers!

will the implementation strip out/ignore 0x prefix instead of error?

Thanks.