jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
29.6k stars 1.54k forks source link

Bug using += #1508

Open pkoppstein opened 6 years ago

pkoppstein commented 6 years ago

CORRECT:

jq -c '.VALUE |= .+[length]' <<< '{"DATE":"September","VALUE":[5,17,8,9]}'
{"DATE":"September","VALUE":[5,17,8,9,4]}

INCORRECT:

jq -c '.VALUE += [length]' <<< '{"DATE":"September","VALUE":[5,17,8,9]}'
{"DATE":"September","VALUE":[5,17,8,9,2]}

This behavior can be observed with jq1.4, jq 1.5, and the master version (jq-1.5rc2-245-g7b81a83).

muhmuhten commented 6 years ago

The documentation is inaccurate. The RHS of update-assignment doesn't receive the LHS as input. The actual behaviour is more like [length] as $_ | .VALUE |= .+$_.

It probably makes more sense to update the documentation than break compatibility, since this is arguably the more useful operation: you're more likely to be doing arithmetic on values at the same level than, say, merging a field of an object with the object itself. e.g. jq -nc '{a:3,b:4}|.a+=.b' And the alternatives to achieve this behaviour are all much longer, compared to just two extra characters to achieve the documented behaviour.

nicowilliams commented 6 years ago

Actually, I think this is a bug indeed!

$ jq -c '(.VALUE) += [debug|length]' <<< '{"DATE":"September","VALUE":[5,17,8,9]}'
["DEBUG:",{"DATE":"September","VALUE":[5,17,8,9]}]
$ jq -c '(.VALUE) |= .+[debug|length]' <<< '{"DATE":"September","VALUE":[5,17,8,9]}'
["DEBUG:",[5,17,8,9]]
{"DATE":"September","VALUE":[5,17,8,9,4]}

The problem is that gen_update() doesn't put the val block in the update lambda, but computes it first with the same input (.) as the LHS:

static block gen_update(block object, block val, int optype) {
  block tmp = gen_op_var_fresh(STOREV, "tmp");
  return BLOCK(gen_op_simple(DUP),
               val,
               tmp,
               gen_call("_modify", BLOCK(gen_lambda(object),
                                         gen_lambda(gen_binop(gen_noop(),
                                                              gen_op_bound(LOADV, tmp),
                                                              optype)))));
}

Compare to the |= operator:

Exp "|=" Exp {
  $$ = gen_call("_modify", BLOCK(gen_lambda($1), gen_lambda($3)));
} |

I don't know if we can fix this now. It's always been thus. @stedolan?

emanuele6 commented 12 months ago

I think this should not be considered a bug. I have always used non-|= assignment operators assuming the right hand side is evaluated once using . as input, just like for =. Maybe this is not how it worked in jq 1.4 or earlier, but I don't think this behaviour should be changed now.

nicowilliams commented 12 months ago

Re-thinking this now, lhs += rhs is really taking whatever rhs is and adding it to whatever lhs is, but what should be the input to rhs? Well, "who cares" is a pretty good answer -- it could be anything at all and the += operator should still work as intended, as long as the rhs ignores its input. Typically the rhs will just be a literal, and literal values ignore their input. Thus, whatever that input is today, that is just fine.

I think perhaps it would have been best to pass the current value of the lhs to the rhs so that one could write expressions like .a += if iseven then 0 else 1 end.

On the other hand, we have documented what the input to the rhs:

image

I think most likely we could make the code match the docs and it should more than likely not break any existing jq code because most jq code using += and friends probably use RHS expressions that ignore their input.

OTOH, this has been like this for a long time, and we could just leave it as-is and instead fix the docs to say that the input to the RHS is unspecified, thus letting us wait until whatever release follows 1.7 to change this.

wader commented 12 months ago

For reference https://github.com/jqlang/jq/issues/2407 might similar or dup of this? also https://github.com/jqlang/jq/issues/2410 about //= might be interesting

muhmuhten commented 12 months ago

For what it's worth, I stand by the position that it's more useful to document the current behavior for the arith-assign ops so that e.g. {a:1,b:2}|.a+=.b yields {"a":3,"b":2}, which is an occasionally useful operation that's substantially less intuitive to express otherwise, especially if the lhs is a nontrivial expression. I'm having a lot more trouble coming up with a use for the documented behavior for the arith ops.

For the //= issue, my (not too considered) thought is that the best unifying semantic might be for lhs op= rhs to be the equivalent of . as $_|lhs |= . op ($_|rhs), which if I'm not misthinking something should provide identical results to current behavior for the ops that always evaluate the rhs, while having a more useful behavior for //=.

nicowilliams commented 12 months ago

For reference #2407 might similar or dup of this? also #2410 about //= might be interesting

~Yes, //= in particular is really busted:~

//= would be silly to send the current value to the rhs since that value would be null or the lhs is not even set.

emanuele6 commented 12 months ago

@nicowilliams A common way I use += is .foo += (-1, 0, 1); example in my day 14 solution for advent of code 2022 to generate the possible directions in which the grain of sand can fall (this is the version that draws an animation in the terminal if used as ./foo.jq input | awk -v RS= '{ system("clear") } 1; { system("sleep .1") }')

#!/bin/sh --
# \
exec jq -nRrf "$0" -- "$@"

[
    inputs / " -> " |
    map(. / "," | map(tonumber)) |
    range(0; length - 1) as $i |
    .[$i:$i + 2]
] |

(map(.[][0]) | min) as $minx |
(map(.[][0]) | max) as $maxx |
(map(.[][1]) | max) as $maxy |
[ 500 - $minx, 0 ] as $sandIn |

.[][][0] -= $minx |
reduce .[] as $line ([];
    def xrange: $line | map(.[0]) | range(min; max + 1);
    def yrange: $line | map(.[1]) | range(min; max + 1);
    .[xrange][yrange] = "#"
) |

def gridToString:
    transpose |
    (.[][] | nulls) = "." |
    map(join("")) |
    join("\n");

{ grid: ., $sandIn, units: [] } |
while(.grid[.sandIn[0]][.sandIn[1]] == null;
    .units += [ .sandIn ] |
    def directions($grid):
        select(.[1] <= $maxy) |
        .[1] += 1 |
        .[0] += (0, -1, 1) |
        select(.[0] < 0 or $grid[.[0]][.[1]] == null);
    reduce (.units | keys_unsorted[]) as $k ({ offset: 0, v: . };
        def unit: .v.units[$k + .offset];
        .v.grid as $grid |
        [ unit | first(directions($grid)) ] as $dir |
        if $dir == [] then
            .v.grid[unit[0]][unit[1]] //= "o" |
            del(unit) |
            .offset -= 1
        else
            unit = $dir[0] |
            select($dir[0][0] < 0) |= (
                .v.grid |= [ [] ] + . |
                (.v | .sandIn, .units[])[0] += 1
            )
        end
    ) |
    .v
) |
reduce .units[] as $u (.grid; .[$u[0]][$u[1]] = "%") |
gridToString,
""

Some example inputs (describe the map in which the sand falls):

498,4 -> 498,6 -> 496,6
503,4 -> 502,4 -> 502,9 -> 494,9
480,10 -> 500,10
510,10 -> 520,10  
480,20 -> 490,20
500,20 -> 500,18 -> 520,18

But I have many other more pratical examples in which I use .foo += manyvalues to generate multiple values with a modified .foo. This is way more useful than having it work as .foo |= . + (-1, 0, 1) so as just .foo += -1.

In my scripts, I can also find many instances of me using .foo += [ if something then x else y end ] where something, uses . as input and x, y are either constants or expressions that use . as input, not .foo.

Also, if you need something like .a += if iseven then 0 else 1 end, you should just use (.a | select(iseven | not)) += 1 ;)

nicowilliams commented 12 months ago

Oh, hmm, no, //= is not particularly busted.

nicowilliams commented 12 months ago

@emanuele6 and @muhmuhten have the winning arguments. Let's fix the docs then.

emanuele6 commented 12 months ago

//= would be silly to send the current value to the rhs since that value would be null or the lhs is not even set.`

It could also be false, sadly // checks trurthiness, not non-null :(


Note that all non-|= assignment operators evalute the rhs first, and stop running if rhs is empty, and run multiple times if rhs returns multiple values since rhs is a $ argument for _assign/2.

What used to confuse me a bit is that rhs is evaluated even if lhs doesn't generate any paths, but that kind of make sense to be honest; also I often exploit that behaviour as:

empty = (.bar | map(.baz) | debug)

As a more convenient way to use debug with an expressions instead of using (.bar | map(.baz) | debug | empty), as I mentioned in https://github.com/jqlang/jq/pull/2478#issuecomment-1236186328; so I kind of like that behaviour now.

muhmuhten commented 12 months ago

Oh, yeah, I definitely overlooked |='s (extremely janky, mind—check your expectation of what [range(10)]|.[] |= empty should yield; I feel like that behavior is only predictable by knowing exactly the implementation detail that |= updates by key in forward order) behaviour with non-one rhs output values. Perhaps the documentation should be more along the lines of "lhs op= rhs is equivalent to lhs = lhs op b, except that lhs is evaluated only once(?)`?

Unifying //= to not evaluate in a way that doesn't bork += manyvalues looks something like lhs op= rhs. as $o|reduce path(lhs) as $p (.; setpath($p; getpath($p) op ($o|rhs))), which is probably a new builtin and ... honestly, I'm not sure if that's worth a breaking change.

And yeah, sending empty to the rhs of //= would be spectacularly silly :p

nicowilliams commented 12 months ago

//= would be silly to send the current value to the rhs since that value would be null or the lhs is not even set.`

It could also be false, sadly // checks trurthiness, not non-null :(

I'm as annoyed by that as you.

Note that all non-|= assignment operators evalute the rhs first, and stop running if rhs is empty, and run multiple times if rhs returns multiple values since rhs is a $ argument for _assign/2.

Yes.

What used to confuse me a bit is that rhs is evaluated even if lhs doesn't generate any paths, but that kind of make sense to be honest; also I often exploit that behaviour as:

That's because null | path(.a) matches just so you can set values at paths that don't exist yet.

empty = (.bar | map(.baz) | debug)

I'm failing to make sense of that.

As a more convenient way to use debug with an expressions instead of using (.bar | map(.baz) | debug | empty), as I mentioned in #2478 (comment); so I kind of like that behaviour now.

I often write def debug($msg): ("\($msg): \(.)|debug|empty), .;. That should be a builtin.

emanuele6 commented 12 months ago

If someone is using .[] //= empty to try delete nulls in the input array/object, they should use del(.[] | nulls) or .[] |= values (now that it is fixed). Even if //= worked as .[] |= (. // empty), that would also delete false so it is not good.

emanuele6 commented 12 months ago
empty = (.bar | map(.baz) | debug)

I'm failing to make sense of that.

(.bar | map(.baz) | debug) is always evaluted, and its result is discarded since empty does not generate any path, so as long as you write an expression that always returns one value, you can use empty = (.bar | map(.baz) | debug) | ... to print to debug output only the result of .bar | map(.baz) instead of the entire input. I find that more convenient to write than having to use (.bar | map(.baz) | debug | empty), (...), so I usually use that.

nicowilliams commented 12 months ago

Oh, yeah, I definitely overlooked |='s (extremely janky, mind—check your expectation of what [range(10)]|.[] |= empty should yield; [...]

@itchyny fixed that! It now yields [].

Unifying //= to not evaluate in a way that doesn't bork += manyvalues looks something like lhs op= rhs. as $o|reduce path(lhs) as $p (.; setpath($p; getpath($p) op ($o|rhs))), which is probably a new builtin and ... honestly, I'm not sure if that's worth a breaking change.

I've struggled with this before. Basically I try never to use non-literals as the RHS of op=, and I mostly use |= for everything else.

An interesting change we could make in the future would be to let the RHS see all of the value being "modified", the path where it's modified, and the . of the LHS. To do this we'd have well-known $bindings like $__path__ and $__dot__. We'd have to be very careful to drop the $__dot__ reference before the setpath() in _modify/_assign. The LHS value would be the input to the RHS, which would be a breaking change for op=, but necessary if we want the RHS to see all three things and avoid unnecessary copies.

And yeah, sending empty to the rhs of //= would be spectacularly silly :p

:)

nicowilliams commented 12 months ago
empty = (.bar | map(.baz) | debug)

I'm failing to make sense of that.

(.bar | map(.baz) | debug) is always evaluted, and its result is discarded since empty does not generate any path, so as long as you write an expression that always returns one value, you can use empty = (.bar | map(.baz) | debug) | ... to print to debug output only the result of .bar | map(.baz) instead of the entire input. I find that more convenient to write than having to use (.bar | map(.baz) | debug | empty), (...), so I usually use that.

: ; jq -cn '{bar:[{baz:1,foo:2}]}|(empty = (.bar | map(.baz) | debug))'
["DEBUG:",[1]]
{"bar":[{"baz":1,"foo":2}]}

?

Interesting!

Well, anyways, we should add a nicer version of debug, because I do something very similar (see above).

pkoppstein commented 12 months ago

@nicowilliams wrote:

we should add a nicer version of debug

I've found this debug/1 to be both simple and serviceable:

def debug(msg): (msg|debug) as $debug | .;

The nice thing is that it does not preclude a debug/2 :-)

Any other suggestions based on the existing debug?

nicowilliams commented 12 months ago

I've added wiki pages on the internals of assignments and path expressions.

pkoppstein commented 12 months ago

@emanuele6 wrote:

Maybe this is not how it worked in jq 1.4 or earlier, but I don't think this behaviour should be changed now.

The above remark was not made w.r.t. |= but (after all the bug fixes that have taken place since 1.5), we seem to have reached a stable equilibrium w.r.t. |= too.

Please also consider that there has been a nice alignment between jq, gojq and jaq, in that all three give the same results for:

{a:1,b:2} | .a += .b
{a:1,b:2} | .a += empty
{a:1,b:2} | .a += (1,2)

In short, I think something like the following ought to be in the documentation:

V | (E += F)  is identical to V | . as $V | (E = ($V|E) + ($V|F))

After all, jq is no longer in alpha :-)


p.s. Maybe these too:

V | (E = F)   is identical to V | . as $V | (E = ($V|F))

V | (E |= F)  is identical to V | . as $v | (E = ($V|E|F))
nicowilliams commented 12 months ago

I've found this debug/1 to be both simple and serviceable:

def debug(msg): (msg|debug) as $debug | .;

I believe that will hold on to a reference on the message $debug for as long as whatever is "to the right of" debug/1 is running.

def debug(msg): (msg|debug|empty), .; should not.

But don't quote me on this. I'm a bit rusty.

wader commented 12 months ago

@nicowilliams out of curiosity: i've usually used this def debug(f): . as $c | f | debug | $c; also bad reference wise i guess? i'm still new to how jv references and things are designed

nicowilliams commented 12 months ago

@nicowilliams out of curiosity: i've usually used this def debug(f): . as $c | f | debug | $c; also bad reference wise i guess? i'm still new to how jv references and things are designed

If you have stuff | debug(msg) | other_stuff, the reference to $c will survive until other_stuff runs out of outputs, and also until f (msg) runs out of outputs. That's why I prefer to use (f | debug | empty), ..

nicowilliams commented 12 months ago

I've created #2708 related to this. The idea is that maybe //= and <op>= just have to stay the way they are but we could do something closer to |= with new assignment operators. Maybe that's too ugly -- we can always just use functions considering that mostly that's what actually happens anyways.

ASIDE: Imagine a "pure" jq where there's no syntactic sugars. No if/then/elif-then/else/end, no try/catch, no binary operators, no assignment operators. We could have a cond/3 builtin that takes the place of if, and a try/2 and try/3 builtins that take the place of try/catch, and plain old jq-coded "core" functions like _assign and _modify (but with public names). That might be ugly compared to having special forms like if/try/<assignment-operators>/<binary-operators>, but as with Haskell and Lisp it may prove useful to have such a tiny core.

wader commented 11 months ago

If you have stuff | debug(msg) | other_stuff, the reference to $c will survive until other_stuff runs out of outputs, and also until f (msg) runs out of outputs. That's why I prefer to use (f | debug | empty), ..

Continued curiosity: would some basic static analysis be able to see that $c don't "escape" outside debug making def debug(f): . as $c | f | debug | $c work equally well?

nicowilliams commented 11 months ago

Continued curiosity: would some basic static analysis be able to see that $c don't "escape" outside debug making def debug(f): . as $c | f | debug | $c work equally well?

Yes, that could be done, and it would be amazing. This is why I have a branch with a JSON dump of block representation of programs: to help me reason about static analysis.

nicowilliams commented 11 months ago

(Sorry I edited your comment, @wader, I meant to quote-reply and flubbed up. I've restored your comment.)

wader commented 11 months ago

@nicowilliams no worries!