purpleidea / mgmt

Next generation distributed, event-driven, parallel config management!
https://purpleidea.com/tags/mgmtconfig/
GNU General Public License v3.0
3.47k stars 308 forks source link

optimize the graph shape to not create sub-graphs when possible #724

Open gelisam opened 7 months ago

gelisam commented 7 months ago

Working on #717

gelisam commented 7 months ago

Here is a corner case.

$apply1 = func($f, $x) {
  $f($x)
}
$apply2 = func($f, $x) {
  $f($f($x))
}
$apply = if ($now % 2 == 0) then $apply1 else $apply2
$id = func($y) { $y }
test $apply($id, "foo")

Every second, $apply will switch from one constant expression to another. Because $apply changes, the $apply($id, "foo") call needs a dynamic sub-graph. With the current implementation, each of those sub-graphs will in turn use CallFuncs for the $f($x) calls, thus resulting in sub-sub-graphs. Since $id is statically-known, it should theoretically be possible to for each instantiation of the sub-graph to use the optimization, that is, to skip the CallFuncs and avoid the need for sub-sub-graphs. Unfortunately, since Value() uses the scope and does not have Graph()'s sctx argument, it is not possible to determine what $f refers to, and thus Value() must return nil.

Is it worth to add the sctx argument in order to address this case, or is this too small of a corner case?

purpleidea commented 7 months ago

Is it worth to add the sctx argument in order to address this case, or is this too small of a corner case?

I do not know, but let's please definitely add this as a test =D

If for weird examples like this we have a dynamic graph, it is okay. What's the sctx argument??

gelisam commented 7 months ago

Another corner case:

$f1 = func($x) { $x }
$f2 = func($x) { $x + $x }
$id = func($y) { $y }
$f = if $id(true) then $f1 else $f2
test $f("foo")

No higher-order functions this time. In order for $id(true) to be statically-known, we need to:

  1. speculatively obtain a full.FuncValue for $id
  2. speculatively obtain a Value for true
  3. call the full.FuncValue on the input Value in order to obtain an output Value

The hard step is (3), because the type

func (obj *full.FuncValue) Call(txn Txn, args []Func) (Func, error)

is not what we need. We need

func (obj *full.FuncValue) SpeculativeCall(args []Value) (Value, error)

is it worth implementing SpeculativeCall in order to support this use case?

purpleidea commented 7 months ago

is it worth implementing SpeculativeCall in order to support this use case?

This case (or something that looks like it) seems more plausible, but still in the 20%.

I think our goal for step one should be to initially make 80% of common mcl code use the simple graph shape, as long as we can extend our optimizations out as a step two without re-writing all our plumbing for step one.

Let's add this as a test btw =D

gelisam commented 7 months ago

What's the sctx argument??

SetScope's second argument:

func (obj *Expr) SetScope(scope *Scope, sctx map[string]Expr) error

in the mcl-functions branch I named it context, I don't know why you renamed it to sctx. In

$f = func($x) {
  func($y) {
    ...
  }
}

it maps $x and $y to different ExprParams.

gelisam commented 7 months ago

Yet another corner case. Lambdas are always constant, but in

$add = func($x) {
  func($y) {$x + $y}
}
test $add("foo")("bar")

the nested FuncExpr.Value() will not be able to produce a full.FuncValue representing func($y) {$x + $y}, because once again .Value() does not receive an sctx.

Actually, this might not be a corner case at all... The FuncExpr doesn't know whether it is a nested lambda, so it doesn't know whether the variables in its body are supposed to refer to top-level definitions or to parameters from surrounding lambdas. Thus, it is never safe for FuncExpr.Value() to return a FuncValue! If FuncExpr.Value() tried to return a FuncValue in the case in which all the variables of its body are in obj.scope, then

$x = "not this x"
$add = func($x) {
  func($y) {$x + $y}
}
test $add("foo")("bar")

would FuncExpr.Value() would incorrectly generate a FuncValue corresponding to func($y) {$x + $y} where $x refers to "not this x" instead of referring to the outer lambda's parameter.

gelisam commented 7 months ago

It thus looks like Expr.Value() should take an sctx parameter. Does it still make sense to use the same Expr.Value() for obtaining the Value to give to a resource, or would it make more sense to split Expr.Value() into Expr.ValueForResource() and Expr.SpeculativeValue(sctx)?

purpleidea commented 7 months ago

Does it still make sense to use the same Expr.Value() for obtaining the Value to give to a resource, or would it make more sense to split Expr.Value() into Expr.ValueForResource() and Expr.SpeculativeValue(sctx)?

Hmmm...

Is it still possible to do a step one where we don't do any sctx stuff, just to merge it and have some easy optimizations done, and grow our confidence in the optimization for those easy things?

The advantage of doing this in two steps are:

1) There's a bigger chance I more fully understand what's going on. 2) There's a chance we have a new realization that precludes needing the sctx thing (maybe unlikely)

The down side is we have extra commits in master, but that just means more points! \o/

It will also give us a chance to briefly verify if the signature of SetScope:

SetScope(scope *interfaces.Scope, sctx map[string]interfaces.Expr)

Still makes sense or if this should be a single combined arg or something different.

Actually a good question is: Are you currently happy/okay with the design of SetScope or is it broken in your mind as a stage?

gelisam commented 7 months ago

right, right, you want me to make these kinds of decisions on my own... I say we keep a single Expr.Value(sctx).

Except we make this speculative call from Graph(env), not SetScope, so it needs to be an env: []Func, not an sctx: []Expr.

purpleidea commented 7 months ago

right, right, you want me to make these kinds of decisions on my own

You're in charge, I'm meant to be here to complain if my spider sense goes off, or if I feel you're writing pure/functional code instead of elegant, imperative, faster golang code ;)

gelisam commented 7 months ago

Is it still possible to do a step one where we don't do any sctx stuff, just to merge it and have some easy optimizations done, and grow our confidence in the optimization for those easy things?

Yes, if we give up on optimizing inside lambdas. ExprCall.Graph() will have to check if sctx is non-empty, and if so, assume that calling Value() on the function would give the wrong result, and thus use dynamic sub-graphs.

purpleidea commented 7 months ago

if so, assume that calling Value() on the function would give the wrong result

So currently we use Value() here for printf... Does this mean that this is currently incorrect? Would we have to plumb through the sctx parameter? (This call happens during unification to inspect the format string.)

https://github.com/purpleidea/mgmt/blob/6a6546db8d3d481cfaf5adcb71c2329d1284cbd3/lang/funcs/core/fmt/printf_func.go#L280

gelisam commented 7 months ago

I'd try this:

$s = "%d"
$f = func($s) {
  print($s, 42)
}
test $f("%s") {}

I suspect that since Value doesn't have an sctx parameter, when printf calls Value() on the $s, it might see the top-level $s = "%d" and incorrectly accept this program instead of failing with a type error which says that we don't know the value of the format string and thus can't check if 42 has the right type.

purpleidea commented 7 months ago

I'd try this:

Indeed there's an issue! Actually I found a sort-of related bug. Hmmm...

Well here's the interesting point: plumbing through sctx would be very complicated... Which leads me to think that the more appropriate solution is one in which sctx is "captured" within the Expr AST node, so that we don't need to plumb it through the API...

gelisam commented 7 months ago

That would not work because the same Expr is used for several lambda calls. You can only store one sctx in the Expr itself, but you can pass a different sctx each time Graph() or Value() is called.

purpleidea commented 7 months ago

Yes, but if we have to have an sctx inside of Unify, that means we need to plumb this into the unification engine...

I feel like the right approach might involve an Initial Expr* whatever ast node, with it's own sctx, but then when it's used we copy it and have a new sctx contained within for each different Expr?

Or something else?

gelisam commented 7 months ago

Remember, we used to do that, but we had to change course in order to support map! Before mcl-functions, each ExprCall copied the body of the lambda it called, and then SetScope was able to make each $x use site inside a lambda func($x) {...} point to the argument $f(arg) of the corresponding call site. We did all that before type checking.

We can't do that with map, because the number of call sites is not known until runtime, after type checking. So we had to switch to a different implementation strategy in which a full.FuncValue generates a sub-graphs. Thanks to those sub-graphs, it is possible to change the shape of the function graph at runtime, so that we end up with one CallFunc for each call site even though the number of call sites is not known until runtime.

But we only create more Funcs, we don't create more Exprs! And we don't need to: lambdas are monomorphic, so the $x is func($x) {...} has only one type, so it only needs one copy.

purpleidea commented 7 months ago

From our phone conversation:

https://github.com/purpleidea/mgmt/blob/6a6546db8d3d481cfaf5adcb71c2329d1284cbd3/lang/ast/structs.go#L8802 https://github.com/purpleidea/mgmt/blob/6a6546db8d3d481cfaf5adcb71c2329d1284cbd3/lang/ast/structs.go#L8685

Both of these ExprParam and ExprPoly return nil, nil. Which we probably don't want.

This might come from here: https://github.com/purpleidea/mgmt/blob/6a6546db8d3d481cfaf5adcb71c2329d1284cbd3/lang/ast/structs.go#L8539

purpleidea commented 7 months ago

(I will add a patch to add that test in and add errors on those two nils.)

gelisam commented 7 months ago

From our phone conversation

More importantly for this MR, since ExprVar.SetScope() does

obj.scope.Variables[obj.Name] = sctx[obj.Name]

There is no need to add sctx to Value() after all.

purpleidea commented 7 months ago

Here's the patch (just testing)

https://github.com/purpleidea/mgmt/commit/a23312a913cb5263a2cbb90b2c4c3403276b2886

Sort of related:

import "fmt"
$format = "%d"  # should get ignored
$fn = func($format) {
    fmt.printf($format, 42)
}
test $fn("%s") {}
# should error at unification if possible, otherwise at runtime

So btw for this test, with the dynamic printf enabled ( PrintfAllowNonStaticFormat = true) this fails at runtime, and with it off, it fails during type unification. I guess that's fine for now. Of course if all the format strings are "%d", then this still can't unify with PrintfAllowNonStaticFormat = false. Which is too bad, but seems fine for now.

Why do I mention any of this? Because speculatively, I would expect this func to be run at compile time...

purpleidea commented 7 months ago

(Merged in master.)

gelisam commented 7 months ago

I would expect this func to be run at compile time

Wait, do you want to evaluate statically-known expressions at compile-time or to generate a graph with no sub-graphs when possible? It's not the same thing!

gelisam commented 7 months ago

Huh, the code on this branch doesn't even compile? 7b5466e fails with

lang/funcs/facts/func.go:90:18: obj.Fact.Call undefined (type Fact has no field or method Call)

Instead of starting from a broken #716, I'll start from master. I'll cherry-pick from #716 later if that turns out to be useful for this MR.

gelisam commented 7 months ago

I'm confused, why does this say that I have closed the PR??

Oh! I have reverted gelisam/graph-shape to point to master, so now it looks like master contains all the commits from gelisam/graph-shape and thus that the branch has been merged! Well played GitHub, this is a clever way to avoid having to store one boolean per PR indicating whether it has been merged or not.

purpleidea commented 7 months ago

lang/funcs/facts/func.go:90:18: obj.Fact.Call undefined (type Fact has no field or method Call)

Oh, hm weird. Maybe I forgot to include something in my commit? if you need it LMK and I'll make sure it's working.

gelisam commented 7 months ago

I have created tests based on the counter-examples in this thread, let's see if GitHub will allow me to reopen the PR...

gelisam commented 7 months ago

I have now made sure the new tests pass on CI, not just on my machine, before I start modifying the code.

Did we decide whether we would implement FakeTxn or if we would change Graph() to receive a Txn instead of returning a Graph?

purpleidea commented 7 months ago

Did we decide whether we would implement FakeTxn or if we would change Graph() to receive a Txn instead of returning a Graph?

We didn't decide, and I'm also undecided. If you feel strongly about one or the other, let's do that. I can't visualize all the implications yet.

purpleidea commented 7 months ago

I can't visualize all the implications yet.

Which really means, we should go with whatever you think is best, and hopefully we don't discover some reason that makes us need to change our minds after of course =D

gelisam commented 6 months ago

Last time we met, I was confused about Expr.Graph()'s env parameter. It represents the subset of variables which are currently in scope which come from a function parameter. I was worried that the a call site's env might be passed to the implementation, which would be bad if there is an $x which is in scope at both places, but it's not the same $x. Like this:

$make_exclaimer = func($x) {
  func() {$x + "!"}  # at the definition site, $x is "foo"
}
$use_exclaimer = func($f, $x) {
  $f() + $x  # at the call site, $x is "bar"
}
test $use_exclaimer($make_exclaimer("foo"), "bar") {}

Luckily, I wrote some comments in ExprFunc.Graph() and ExprCall.Graph() which clarify how I am handling the situation.

There are actually two separate cases. In the easy case, the definition site is a top-level definition:

$x = "foo"
$exclaimer = func() {$x + "!"}  # at the definition site, $x is "foo"
$use_exclaimer = func($f, $x) {
  $f() + $x  # at the call site, $x is "bar"
}
test $use_exclaimer($exclaimer, "bar") {}

In that case the definition site's env should be empty, and the call site clears out the env when making the recursive Graph() call, so everything is fine.

In the more complicated case, the definition site is inside another function, in this case $make_exclaimer. This time there is no recursive call to Graph(), so there is no opportunity for the call site's env to be accidentally given to ExprFunc.Graph(). Instead, the Funcs for the parameters have already been created earlier and the env is holding on to them. It is when that already-created Func was created that the definition site's env was captured. The call site doesn't call this Func directly, instead it adds an CallFunc of its own, and it is only a runtime when that Func sends a full.FuncValue to that CallFunc that the env hidden inside this full.FuncValue comes into play.

gelisam commented 6 months ago

For Value, this is a bit more complicated because we don't know whether if we're in the easy case or the complicated case. So the first step is to figure that out.

It should be possible to figure it out by traversing the body and check that all of the ExprVars refers to an ExprParam which is either introduced by the lambda itself or by a nested lambda. If so, we are in the easy case and env should be empty. Otherwise, we are in the more complicated case, which we can keep for later.

gelisam commented 6 months ago

Actually, ExprCall.Graph() only calls,Value() in that easy case, so in order to validate the approach, we can just unconditionally use the empty env. But that would cause runtime crashes in other circumstances, so we definitely can't skip the check in the name of only supporting the optimization for the rest case.

gelisam commented 6 months ago

@purpleidea I have now dealt with the env in ExprFunc.Value(). The optimization should now work in the most common cases, but I can't check because the code doesn't compile. It wasn't compiling for me before I made my changes either.

purpleidea commented 6 months ago

Sweet. New branch available with your changes and fixes in feat/graph-shape. More comments here: https://github.com/purpleidea/mgmt/issues/717#issuecomment-1909226779

purpleidea commented 5 months ago

Sorry I've been a bit AFK, I'm mid conferencing and gave my first presentation today. One more on Monday, and then a workshop on Wednesday =D Thanks for all your great work. Ping me for anything you're stuck on or busy work that you need me to refactor/golangify/etc.

gelisam commented 5 months ago

Sorry I've been a bit AFK

Me too, I was in NYC for work! I have also been elected as one of the administrators of my building, and it's taking a lot of time. I was actually working on mgmt to relax a bit from all that 😅

I'm mid conferencing and gave my first presentation today. One more on Monday, and then a workshop on Wednesday =D

Woohoo! I hope everybody is as excited about mgmt as you were hoping!

Ping me for anything

Not a chance, focus on your conference!

gelisam commented 5 months ago

165 / 206 tests pass, not too bad but there is still a lot of work to do!

gelisam commented 5 months ago

up to 183 / 206!

purpleidea commented 5 months ago

Sweet! Remember TestAstFunc1 doesn't matter since that's just shapes which will change.

On Sun, Feb 4, 2024, 05:02 Samuel Gélineau @.***> wrote:

up to 183 / 206!

— Reply to this email directly, view it on GitHub https://github.com/purpleidea/mgmt/pull/724#issuecomment-1925572476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7M2S363NZG6VI4B52EDYR4B4PAVCNFSM6AAAAABAX2TBGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGU3TENBXGY . You are receiving this because you were mentioned.Message ID: @.***>

gelisam commented 5 months ago

Oh, I didn't even look at TestAstFunc1, those are all TestAstFunc2 failures.

Here's the next failure:

$f = func($x) {$x + "!"}
test $f("a") {}
test $f("b") {}

Since the problem only occurs when the function is called twice, I suspect that we're trying to Revert () the nodes added by the first call, so they're no longer available for the second call.

gelisam commented 5 months ago

up to 195 / 206!

gelisam commented 5 months ago

The next failure is chained-vars.txtar, slightly tweaked to make the labels unique:

import "fmt"

$zero = 0
$one = $zero + 1
$two = $one * 2 # needs a chain to panic

test fmt.printf("%d%d%d", $zero, $one, $two) {}

The test always generates the wrong Func graph, but it is non-deterministic, different runs generate different incorrect Func graphs:

failure1 failure2 failure3 failure4 failure5 failure6

I am thus guessing that when we generate the static graph and capture it via the Txn which isn't connected to the engine, the code which generates the static graph spawns some threads and not all the nodes have been added by those threads when the graph is extracted from the Txn.

purpleidea commented 5 months ago

I'm trying to think what would race there, that feels a bit unusual unless there was something concurrent added in this branch. I'll look at the diff tonight. Could there be a bug or an API mismatch somewhere in the TXN code?

I'll be back in a few days and on a regular hacking schedule again.

Thanks for your work here!

On Sat, Feb 10, 2024, 22:37 Samuel Gélineau @.***> wrote:

The next failure is chained-vars.txtar, slightly tweaked to make the labels unique:

import "fmt"

$zero = 0 $one = $zero + 1 $two = $one * 2 # needs a chain to panic

test fmt.printf("%d%d%d", $zero, $one, $two) {}

The test always generates the wrong Func graph, but it is non-deterministic, different runs generate different incorrect Func graphs:

failure1.png (view on web) https://github.com/purpleidea/mgmt/assets/49000/a0b0c147-0188-4b5f-9211-ec17025a4ea2 failure2.png (view on web) https://github.com/purpleidea/mgmt/assets/49000/e9553b4e-865f-42f1-84c9-6cc233a0e3ef failure3.png (view on web) https://github.com/purpleidea/mgmt/assets/49000/fe4968df-62de-48d7-89ae-75a3b0af723c failure4.png (view on web) https://github.com/purpleidea/mgmt/assets/49000/96734ece-78b9-44af-9d12-ac7d4befed98 failure5.png (view on web) https://github.com/purpleidea/mgmt/assets/49000/654e1b02-1041-495a-80c9-ec4631f58587 failure6.png (view on web) https://github.com/purpleidea/mgmt/assets/49000/54f4c86e-6670-4184-aa05-72eed9bd7e8e

I am thus guessing that when we generate the static graph and capture it via the Txn which isn't connected to the engine, the code which generates the static graph spawns some threads and not all the nodes have been added by those threads when the graph is extracted from the Txn.

— Reply to this email directly, view it on GitHub https://github.com/purpleidea/mgmt/pull/724#issuecomment-1937189321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7M2A4YOILXARWGNZNNTYS7R73AVCNFSM6AAAAABAX2TBGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGE4DSMZSGE . You are receiving this because you were mentioned.Message ID: @.***>

gelisam commented 5 months ago

I think the problem is as follows.

For dynamic graphs, it is normal for a Func to hold onto the Txn and to use it to add more nodes asynchronously over time. For the optimization, I need the opposite: I need the full.FuncValue to add all of its nodes synchronously, before we extract the static graph from the Txn. I assumed that in those cases in which we know that the graph is static, the full.FuncValue must be adding all of its nodes synchronously. But I was probably mistaken: the graph will quickly reach an equilibrium, but it might do so asynchronously.

gelisam commented 5 months ago

I tested my hypothesis by sleeping for three seconds before extracting the graph for the Txn, but that did not fix the problem. Thus, it is not the case that the graph quickly reaches an equilibrium.

Maybe it's not threads, but the random order in which the elements of a map are traversed?

gelisam commented 5 months ago

Let's see, so ExprInt.Graph() is called on 0, and a ConstFunc is added to the Func graph.

Then, ExprCall.Graph() is called on $zero + 1. ExprFunc.Value() is called speculatively on +, and it successfully returns a full.FuncValue. In this branch, there are now two kinds of full.FuncValues: timeless and timefull. You and I know that + is timeless, but we haven't taught mgmt about this fact yet, so the returned full.FuncValue is actually timefull.

Timeless or timefull, the optimization fires regardless, so the full.FuncValue is only called once. This particular full.FuncValue comes from FuncToFullFuncValue, which adds _operator's three input edges, synchronously. Looks like there is a step missing to actually add the _operator itself, that's fine here since adding an edge also adds the input and output nodes, but this is going to cause problems for functions which take zero arguments.

Everything so far seems to work as intended, since that _operator node and its input edges are always present in the GraphViz representation.

Next, ExprCall.Graph() is called on $one * 2. ExprFunc.Value() is called speculatively on *, and presumably it again successfully returns a timefull full.FuncValue. Thus, the optimization fires again.

The FuncToFullFuncValue adds another _operator and its three input edges, synchronously. But this time, something goes wrong, because only a random subset of those edges appears in the GraphViz representation.

What are the differences between the + and * cases? I can't think of any. They both use the function _operator, they both have one argument which is a variable and whose target is already in the Func graph when ExprCall.Graph() is called, and one argument which is a number and is not yet in the Func graph. The one obvious difference is that $one's target is an _operator which already has input edges coming into it, instead of a ConstFunc with no inputs, but what does that change?

Another difference is that $two comes later in the Ordering dag than $one. So if there are two threads running and there is a race, maybe the first thread always has time to process $one but is still in the middle of processing $two when the second thread wins the race. And the printf, which is just completely missing from the Func graph, never even gets a chance to get processed. The fact that adding a three second sleep just before extracting the graph from the Txn didn't change anything argues against this idea, but since I'm running out of ideas, er, maybe there's a different place where sleeping would make a difference?

purpleidea commented 5 months ago

The FuncToFullFuncValue adds another _operator and its three input edges

This jumped out at me... Is the pointer of _operator unique or is there a chance you forgot to copy it or something and it's actually the same operator pointer as previously used?

If that's not it, then don't worry about this, we'll debug together after I look at the code more. I'm back in two days and I'll be focusing more on coding and not demoing/etc. Cheers

gelisam commented 5 months ago

is there a chance you forgot to copy it or something and it's actually the same operator pointer as previously used?

I did forget initially, and one of the tests caught it, and that mistake has since been fixed. But who knows, maybe I fixed it incorrectly, just enough to fix that other test but not well enough to fix this one as well?

Nah, there are clearly two _operator nodes on that GraphViz representation, I don't think that's it. But it's worth trying a variant with a built-in other than _operator!

gelisam commented 5 months ago

Nope, same behaviour if I use contains instead of *