jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.33k stars 1.57k forks source link

Add `while(cond; update)` to construct generic generators #314

Closed nicowilliams closed 10 years ago

nicowilliams commented 10 years ago

jq has generators (e.g., range, .[]), but none can be defined as defs in jq without resorting to the handful of built-in generator primitives. (Explicit backtracking to resume a suspended generator is accomplished with empty.) This would be OK, but using range and .[] for defining generators is problematic: using range(0;pow(2; 50)) to drive repetition will result in that many repetitions -- there's no way to stop range() from producing the full list of requested outputs.

While working on adding I/O primitives to jq I considered a while and until primitives, but these only work with non-functional expressions interacting with the outside world (e.g., eof(stdin)); they can't serve to define purely-functional jq functions. Therefore I'm abandoning that approach.

Instead I propose a yield EXP (EXP; EXP) expression, where the three expressions correspond to: extractor, init, and update. The init expression would get yield's input and would output a "state" value, the update expression would get the current state value as its input and would output a next output value, and for each state value the extractor expression would be run with that state as its input. The process ends when the update expression backtracks without having produced a new output state. If the init expression never outputs a state value then the extractor will not be run at all. The outputs of yield would be the outputs of the extractor expression.

A yield primitive would allow the use of I/O primitives to interact with the outside world, and it would also allow jq programs to define generators without depending on range() and .[], allowing for early termination (e.g., due to external factors, again, like eof(stdin)).

My current design has a YIELD opcode that takes a variable as an operand. When executed this will save_state() and will push a JV_TRUE value onto the stack; on bracktrack it will load the given variable and will goto backtrack if that variable has a false value stored in it, else it will save_state() again and push a JV_FALSE value onto the stack. I.e., YIELD would be like a conditionally repeating FORK that pushes a boolean on the stack to indicate whether this is the first or non-first iteration. The idea is that a STOREV of a false value onto the same state variable, and a conditional jump would immediately follow the YIELD, jumping to the init or update expressions' bytecode depending on the value at the top of the stack (the one pushed on the stack by YIELD), with the init and update expressions ending with a pair of STOREVs to save a) the yield state, b) an indication that they updated the sate, c) the extractor expression's bytecode.

nicowilliams commented 10 years ago

I'll implement and rebase my I/O builtins work using yield. This design is much more satisfying than my previous design, and seems much more natural.

nicowilliams commented 10 years ago

I'll actually make it yield(init; update; extract), and I'll make it a bytecoded builtin using a new YIELD opcode that takes a variable name. On backtrack YIELD will read the variable and backtrack if the variable has a false value, else it will store false in that variable, push false on the stack and continue. Otherwise YIELD will store false in that variable, push true on the stack, and continue. The bytecoded sequence will follow the YIELD with a conditional to call the init or update lambdas if this is the first or subsequent run, respectively, will then STOREV true to the same variable, all followed by a call to the extract lambda.

nicowilliams commented 10 years ago

And I should note, in case it wasn't obvious, that when YIELD doesn't backtrack, it will always create a backtrackpoint.

georgir commented 10 years ago

Seems to me that recurse does pretty much what you described, except that it collects all the produced states. But that can be discarded: a final output can be extracted from the result's last element (final state) EDIT: or from all collected states, really

nicowilliams commented 10 years ago

recurse does no tail-call optimization -- it can't because it can't know when a result from the argument expression puts it in a tail position. TCO is important; having no way to do it means we need an alternative way to construct generators from non-generating expressions.

georgir commented 10 years ago

I'm sorry but I do not understand... without having looked at the code, just based on the function's behavior and how I implemented it in my jjq (javascript jq) spinoff, I'd say recurse does not actually use recursion but a loop instead. It runs the same filter over and over on each following result, collecting the results in a queue, not a stack, until the queue is empty.

nicowilliams commented 10 years ago

@georgir The definition of recurse is this:

def recurse(f): ., (f | select(. != null) | recurse(f));
georgir commented 10 years ago

It is so awesome to see it implemented in jq itself... did not even occur to me, even though it is so trivial. I've updated my js spinoff to use a stack instead of queue, to match jq behavior. Pity I can't use jq's implementation literally, it works up to range2(1;14) and gives "too much recursion" for range2(1;15). Puny JavaScript! :p

def range2(from; to): from | recurse(if .+1 < to then .+1 else empty end);

In the mean time, it works fine in jq for quite large values, even though an order of magnitude slower than the native range. On my machine range2(1;10000) takes 14 seconds for example, not all that bad. Maybe for a more realistic use case with large objects as states its usability will drop faster, but I can't think of a good use-case to test with now.

nicowilliams commented 10 years ago

@georgir There's no way to make range stop producing values, even if you're not interested in any more of them and will empty for each remaining one. Without a limit there's no way to scale the use of range up to large numbers of outputs.

nicowilliams commented 10 years ago

@georgir This is why I think we need a generator construction that works by repeated invocation of a state update expression and a value extractor expression -- where those expressions are not [necessarily] themselves generators, or not quite the generators that the user wants.

georgir commented 10 years ago

I just gave the range2 example to show that recurse is still quite useful for defining custom generators - even without tail-call optimizations it manages tens of thousands stack levels. I'm not suggesting you use this literally or the native range to drive your generators.

re: "... by repeated invocation of a state update expression and a value extractor expression ..." if you skip the 'value extract expression' and just emit each processed state, you will in fact replicate recurse in a non-recursive, loop based variant. value extractor expression can always be tacked on by chaining. if you are set on implementing this, you can add different versions that use queue instead of stack or emit the current value after its derived values instead of before, etc to switch from depth-first to breadth-first or to really-really-depth-first, etc. though i guess these alternatives might be implementable in jq itself too

nicowilliams commented 10 years ago

@georgir I completely agree that recursion can be used to define generators now, usefully. But the lack of TCO is a real problem.

(Requiring TCO and providing only recursion for iteration is a fairly common pattern in function language design. A syntactic transformation of recursion-with-TCO to procedural-looking loop is always possible without changing the functional nature of the language. Often it's easier to implement such a loop syntax than it is to implement TCO, and either way code is more readable for many people if you provide such loop syntax. That's all this is about.)

nicowilliams commented 10 years ago

I'd like to note that there are two classes of iterator-creating constructs that we need:

limit falls into the latter category. yield into the former category.

Both are iterators, with functional logic, but lacking a tail call optimization (TCO) recursion is not sufficient.

Perhaps we should own up to it and, like lisps before, have a looping construct that happens to be functional (and functionally equivalent to recursion with TCO) but which resembles procedural syntax, something like:

foreach INPUT_EXP as $var (state_init_exp; cond_exp; state_update_exp; extract_exp)

and

for (state_init_exp; cond_exp; state_update_exp; extract_exp)

I like the idea of a "yield" keyword, so maybe the syntax could be:

foreach INPUT_EXP as $var yield EXTRACT_EXP (state_init_exp; cond_exp; state_update_exp)

and

yield EXTRACT_EXP for (state_init_exp; cond_exp; state_update_exp)

I think I like this syntax quite a bit.

A de-sugared limit would then be:

def limit(n; e): foreach e as $in yield $in (n; n > 0; n - 1);

A read builtin that uses a C-coded, non-generator wrapper for the jv_parser would look like:

def read: yield _read for (true; .; eof);
georgir commented 10 years ago

Is TCO planned for jq in general? Because recurse is a good candidate, with the current definition at least - recurse tail-calls , which tail-calls | which tail-calls recurse... the internal implementation can easily be adjusted to make these real tail-calls if they are not currently, but the question is if the VM can support it.

If not, it should be trivial to replace recurse with a loop-based built-in equivalent.

If this is done there may be a good idea to add some sort of total running time limiter though, as filters like def a:a;a will get promoted from mere errors to real DOS material for applications like jqplay.

georgir commented 10 years ago

Regarding your more general idea of adding generators like read, etc., I'm not sure what to think... The thing that bothers me is they "have side-effects", and will cause jq to not be purely functional any more.

For example, how does def f(a):{a1:a, a2:a} work? Does it run a once and cache its results, then make the combinations from that? Or does it run it twice, and cache each run's results, and make the combinations from that? Or does it run it 1 + N times if the first run returned N results? With purely functional semantics, this does not matter. Either of those options is going to give the same result, and it is just internal implementation details which option is chosen. It can even be changed between versions to shift balance between memory consumption and computational load. But if a may have side-effects and it is not going to return the same outputs for the same inputs, the above implementation detail becomes much more important, and no longer transparent to the end-user. That's not necessarily a bad thing, but it's something that must be clarified and considered more carefully.

nicowilliams commented 10 years ago

@georgir Quite clearly recurse(.[]) does not use tail recursion. To do so it would have to maintain a stack of tree nodes in an array, which it clearly does not -- instead it uses the jq stack for that purpose.

The existing iterators/generators in jq have side-effects internally: they have modifiable internal variables that are only indirectly exposed to the jq program. The generalized iterators would be no different. Any iterator that can be mapped to functional recursion is functional. It is easier to implement (and use) such iterators than it is to implement TCO. There is a lot of literature on this subject.

Consider this diff:

diff --git a/execute.c b/execute.c
index 6e0ddc1..a3da68c 100644
--- a/execute.c
+++ b/execute.c
@@ -111,6 +111,7 @@ static struct frame* frame_push(struct jq_state* jq, struct closure callee,
                                 uint16_t* argdef, int nargs) {
   stack_ptr new_frame_idx = stack_push_block(&jq->stk, jq->curr_frame, frame_size(callee.bc));
   struct frame* new_frame = stack_block(&jq->stk, new_frame_idx);
+  fprintf(stderr, "calling jq fucntion with new frame address %d\n", new_frame_idx);
   new_frame->bc = callee.bc;
   new_frame->env = callee.env;
   assert(nargs == new_frame->bc->nclosures);

Now run jq .. on some input like [0,[1,[2,[3]]]] and you'll see that there's no TCO. There can't be for something like .., not without very complex optimization that turns the non-tail recursive form into a tail-recursive form that uses an array as a stack.

BTW, it's not always easy to determine whether some function call is in "tail position". In C it is: if you see something like return foo(...);, that's a tail position, but return foo(...) + ...; is not!

nicowilliams commented 10 years ago

recurse is defined thusly:

def recurse(f): ., (f | select(. != null) | recurse(f));

That's NOT tail recursive. It looks like it is, but it's not. The reason it's not is backtracking: when recurse is "done" it simply backtracks, and when it does it resumes a "suspended" (Icon terminology; sorry) expression, and this is state that has to be kept somewhere, and that somewhere has to be the stack.

Here's a non-tail recursive factorial in jq:

jq 'def fact: if . <= 1 then . else . * (.-1|fact) end; fact'

What would a tail-recursive form look like? Something like this:

jq 'def fact(r): if . <= 1 then r else . as $dot|.-1|fact(r * $dot) end; def fact: fact(1); fact'

But jq can't tell that fact(r) here is tail recursive without understanding that everything to the left of the recursive call just backtracks when backtracked to. Understanding that would require a fair bit of smarts here. And that's just for the easy case. Making recurse() tail-recursive is not trivial, though it can be done, but I don't see the need: we can expect jq to be able to recurse as deeply into a JSON structure as the structure (that jq parsed) is deep, and we do pay a price in terms of memory footprint by using a non-tail recursive form, but at least that price isn't failure.

Now look at how reduce and range (and .[], and ,) are implemented. They're loops! With state variables. They are procedural loops. Except that they are functional: if jq had TCO then they could be reimplemented in terms of tail recursion. But so what, what would we gain from that? We still need the syntactic sugar that make these constructs easy to use: because writing tail-recursive iterators is hard for most programmers. There's no shame in having such syntactic sugar. The purist battle over the LISP loop macro was fought decades ago, and we don't need to re-fight it. If an iterator's semantics are such that a mechanical mapping of it to a pure tail-recursive form is possible, then the iterator is functional and we can focus then on the syntactic sugar's merits. One of the nice things about jq is how accessible it is; forcing users to write tail-recursive iterators would make jq decidedly not accessible (and besides, we have reduce already, so that "Rubicon" was crossed long ago).

nicowilliams commented 10 years ago

@georgir I should explain too what TCO means: it means that because there's nothing in the calling function's frame that will be needed when the called function returns, we can and do just reuse the calling function's frame. Thus we save the allocation of a new frame.

What happens in a tail-recursive iterator is that the loop state is passed as arguments to the tail-called function, which then end up being place... in the same memory locations as they were for the caller. This is exactly the same as having variables which get updated at the top of a procedural loop! The only difference between "procedural" and "pure functional" is whether those variables can be written to by the loop body by any means other than doing a "continue" -- that's it. In a pure functional program you can't change variables, you can only setup new bindings with let or function calls, and TCO is just what you need to make that scale (which is why the Scheme specification requires TCO, whereas LISP, which is less pure, does not).

Almost invariably pure functional languages grow syntactic sugar to save their users the bother of having to think of how to write their iterators in tail-recursive fashion.

I'm not saying that jq wouldn't benefit from having TCO, but that TCO would be much harder for me to implement than first-class functional iterator syntax.

nicowilliams commented 10 years ago

You can also observe all the frames for recursion with jq --debug-trace.

georgir commented 10 years ago

I understand that if f gives multiple results, all the ones except the last one are not tail-recursion, but the last one can be. And if it only returns one result, it is perfectly tail-recursion optimizable. That will cover a lot of use cases.

Cases where it returns multiple results can be refactored to return a single more complex result that keeps track of things with its own stack/queue. Much like non-deterministic finite-state automata can be reworked to deterministic ones.

On Thu, Jun 19, 2014 at 10:19 PM, Nico Williams notifications@github.com wrote:

You can also observe all the frames for recursion with jq --debug-trace.

— Reply to this email directly or view it on GitHub https://github.com/stedolan/jq/issues/314#issuecomment-46604677.

nicowilliams commented 10 years ago

@georgir It's not about whether f produces multiple results. It's about what happens at the call site when f backtracks. If you'd like to contribute TCO, give me a PR. The simplest design for TCO for jq that I can think of involves writing a block analyzer and rewriter. An alternative would be to do the same for the linked and compiled bytecode. A third, much harder alternative would be to rewrite parser.y and compile.c to use an AST and then you can analyze and optimize the AST. A fourth alternative would be to make the execute.c code for CALL_JQ work backwards to see if this is a tail call. Even then I'll still end up implementing sugar for it.

nicowilliams commented 10 years ago

Looking at the disassembly and trace of what I think is a tail-recursive fact (above), I think that is almost correct. If you trace the execution of the tail-recursive form of fact you'll see that the final value is computed by the closure r and that the frames for fact only return, so those frames can be TCOed. However, the closure r has frames that can't be TCOed.

nicowilliams commented 10 years ago

I believe this is tail-recursive:

def _fact: if .[1] <= 1 then . else [.[0] * .[1], .[1] - 1]|_fact end; def fact: [1, .]|_fact|.[0];

In terms of reading the bytecode to determine that it's tail-recursive, what makes it tail recursive is:

This is an easy pattern to check for, so TCO looks doable. Probably the best place to detect this is in jq_next(), in the handling of CALL_JQ: if nclosures == 0 and the instruction at retaddr is a RET and it's the last instruction for frame_current(jq)->bc (by checking retaddr against frame_current(jq)->bc->codelen), then this is a tail call and we can pop the current frame and push a new one, or otherwise fix up the current frame.

nicowilliams commented 10 years ago

Actually, there's no need to check that the RET is the last instruction of the function.

Also, reusing the current frame is a bit tricky. The process should look like: pop the value from the stack (the called function's input), extract the retaddr from the current frame, frame_pop(), frame_push(), set the new frame's retaddr to be the same as the previous frame's, push the called function's input, and proceed as usual. The called function's input pop/push is already done now.

nicowilliams commented 10 years ago

@georgir Well, I went ahead and implemented (and pushed) TCO. It's not general enough (it doesn't apply to co-/mutual recursion), but it will do for expressing iterators.

nicowilliams commented 10 years ago

@georgir Thanks for pushing me to do it! Turned out to be quite simple. Simpler than I'd thought.

pkoppstein commented 10 years ago

@nicowilliams - Nice job!

nicowilliams commented 10 years ago

Well, so now here's how to define a range/3 that is tail recursive:

def range(init; upto; by):
    def _range:
        if (by > 0 and . < upto) or (by < 0 and . > upto) then ., ((.+by)|_range) else . end;
    if by == 0 then init else init|_range end | select(. < upto);
. as $dot | range($dot|.[0]; $dot|.[1]; $dot|.[2])

And here's a for/2 that's almost just like the proposed yield:

def for(cond; update):
    def _for:
        if cond then ., (update | _for) else . end;
    _for;

There's no need for an init argument: . is it. But a for/3 can easily be defined:

def for(init; cond; update):
    def _for:
        if cond then ., (update | _for) else . end;
    init | _for;

With TCO this works efficiently.

We can now produce generic generators. The pattern is to use the comma operator, recursion, and sub-functions (yes! jq has those!).

nicowilliams commented 10 years ago

@pkoppstein @georgir @stedolan Comments as to what the new builtin generator should be called? I'm partial to for(cond; update). I've decided I don't like yield after all.

pkoppstein commented 10 years ago

@nicowilliams - brilliant!

Regarding the naming - it seems to me that if you're going to use the name "for" then it should be the three-argument version. For for/2, have you thought about while(cond; update)?

An alternative that emphasizes "I am a generator" would be "generate", e.g.

def generate(update; while):
  def _for:
    if while then  ., (update | _for) else empty end;
  _for;

Example: 0 | generate( . + 1; . < 5)

0 1 2 3 4

nicowilliams commented 10 years ago

@pkoppstein Yeah, I have. If you look at my I/O branches you might find a repeat builtin. Perhaps I could have a bit of syntactic sugar to turn a generate EXP while EXP into generate(EXP; EXP), or some variant. If I were to do that then I'd be back to liking "yield": yield EXP while EXP.

pkoppstein commented 10 years ago

Let me retract generate/2 i.f.o. generate/3:

def generate(candidate; condition; nextCandidate):
  def _for:
    if condition then  ., (nextCandidate | _for) else empty end;
  candidate | _for;

This of course is identical to for/3 except for the "empty". What's nice about generate/3 is that (1) every generated value satisfies "condition"; (2) it fits in with repeat(m;n) in that both ignore their input; and (3) the functor is "generate".

(for/3 is currently my second favorite. "yield" doesn't work for me at all as a functor name, and I can't think of any helpful syntactic construct using "yield" as a keyword.)

p.s. If for some reason you really want a generator that takes its initial (candidate) value from its input, then it deserves to be regarded as a "factory", ergo noun, ergo "generator".

nicowilliams commented 10 years ago

@stedolan voiced his opinion as to naming this builtin.