MikeStall commented 2 years ago

(this is a bit more advanced for a first issue - requires knowledge of topological sorts. But it's marked as a "first issue" since the code is very accessible with few dependencies)

RecalcEngine's lets you define Formulas (via SetFormula) and then it will automatically recalc the formula if the dependencies change.

But the implementation here is extremely hack: https://github.com/microsoft/Power-Fx/blob/e2ffacc79cd57a2f5b9a30cc05c41460d3478fb0/src/libraries/Microsoft.PowerFx.Interpreter/RecalcEngineWorker.cs#L18

This is the ability to call UpdateVariable() and cause other Formulas (created from SetFormula) to be recalculated.

It's good enough to pass the tests, but a very poor implementation. See tests here: https://github.com/microsoft/Power-Fx/blob/e2ffacc79cd57a2f5b9a30cc05c41460d3478fb0/src/tests/Microsoft.PowerFx.Interpreter.Tests/RecalcEngineTests.cs#L124

A good implementation:

would statically understand what each formula depends on
efficient recalc - does not unnecessary work.
robust handling for circular references
include aggressive tests.
allow deleting a formula.

The current implementation is also very inefficient. We could construct a perf benchmark to show before and after.

omprakaash commented 2 years ago

Can I take up this issue? I am interested in working on this. My current understanding is that the implementation of RecalcEngine avoids cycles by only allowing new formulas to refer to previously defined formulas and by not allowing formulas to be redefined. Would an improved implementation allow for formula redefinition by detecting cycles in the top-sort algorithm ?

orcmid commented 2 years ago

This is an off-hand comment. I have not examined RecalcEngine, just musing from somewhere close to first principles. I recall that there was once an iteration cap that was used to clamp off recalcs with cycles. Is that still around?

@omprakaash asks: Would an improved implementation allow for formula redefinition by detecting cycles in the top-sort algorithm ?

Technically, there is no topological sort of a directed graph having cycles longer than 1. I just dug this out of "The Art of Computer Programming", volume 1 Exercise 2.3.4.2-4.

So, the ideal conditions given by @MikeStall require some sort of hacks for which there is assured termination and minimal update (addition, change, deletion) consequences. That's a large order.

The other aspect (in comparison with Knuth's solution) is that it would be unusual for the (Excel) dependency tree to be rooted. There are usually multiple cells that are not depended on and may or may not depend on others although I presume that the inversion (ones depended on that depend on no others) is important for recalc. Then there's the matter of determining when to recalc when change(s) occur and how generational updates occur (I am guessing though.)

PS: I have this as a practical problem with a pet data structure, although the conditions for topological sorting are satisfied. The inversion is desired for extracting the structure to a sequential stream. The optimization challenge is having nodes appear in the stream close to where they are (first) needed (depended on) as chunks of the data structure are composed. I expect there will have to be a good-enough heuristic for that.

MikeStall commented 2 years ago

@omprakaash - feel free to take a look! I'd recommend a) understanding why the current implementation is poor; b) perhaps starting with some new tests (bonus if they break the current impl), c) a draft PR to capture the essence of your fix.

(fyi, @anderson-joyle had also expressed interest in this).

@orcmid - we do not support cycles - so SetFormula() should fail if attempting to create a cycle. It does beg the question if we should have an overload of SetFormula() that accepts multiple formulas and sets in bulk.

This is still preview features, so we can make breaking changes to the SetFormula contract here if really needed.

omprakaash commented 2 years ago

Got it. Thank You! Spent some time testing. Will create a Draft PR to discuss more about it.

@orcmid Thank you for the detailed reply. Sorry I wasn't clear on the top-sort part. I meant that the algorithm itself could detect cycles as a means of error management. Thus, cycles can now be prevented without placing restrictions on formula redefinition.

anderson-joyle commented 2 years ago

Will be good to see 2 or more approaches to this problem.

MikeStall commented 2 years ago

One thing that's hacky about the current implementation - we shouldn't need Hash in RecalcEngineWorker. If we update A, we should know statically the other vars that need to be updated, without having to do string comparisons or hooking the resolver or IScope.

jcoc611-microsoft commented 2 years ago

What do you all think of these C# APIs? I think this transaction-style approach would be slightly simpler (orthogonal to a better implementation for tracking dependencies)

RecalcEngine re = new();

// None of these affect the underlying RecalcEngine until the re.Recalc(rt) call
RecalcTransaction rt = re.NewTransaction();
rt.UpdateVariable(name, formulaValue);
rt.UpdateFormula(name, formula);
rt.RemoveVariable(name);
rt.RemoveFormula(name);
//...
rt.DirtyFormulas;         // fast, topo-sorted, and deduped no-alloc enumerator
rt.DirtySymbols;          // both formulas and variables
rt.Errors, rt.HasErrors; // e.g. missing symbols, cycles, etc

await re.Recalc(rt); // async and parallelized, maybe throws if rt.HasErrors?

Not sure if onUpdate callbacks are also needed.

Another interesting issue that hasn't been mentioned is "volatile" functions such as Today() which technically don't depend on other variables but should probably be recomputed on every recalc (this is the Excel policy)

MikeStall commented 2 years ago

@bryancroteau-MSFT is looking at adding topological sorts. That could be useful here. https://github.com/microsoft/Power-Fx/pull/320

bryancroteau-MSFT commented 2 years ago

320 has been merged into main. The next step is to modify the binder, or add a new binder-like tree visitor, which can produce a list of known edges between formulas.

jcoc611-microsoft commented 2 years ago

320 has been merged into main. The next step is to modify the binder, or add a new binder-like tree visitor, which can produce a list of known edges between formulas.

what's the difference between that and using checkResult.TopLevelIdentifiers?

bryancroteau-MSFT commented 2 years ago

I will need to see how that is being used. The difference is that we would want the binding to skip some work if we are just checking for dependencies between formulas without doing full type reconstruction or type checking.

omprakaash commented 2 years ago

There seems to be a breaking bug in the current implementation of RecalcWorker when the dependency graph is diamond shaped. I included some tests in PR #315. Eg: a = 1,b = 2*a, c = 2* a, d = b + c.

microsoft / Power-Fx

Improve recalc engine algorithm #308

320 has been merged into main. The next step is to modify the binder, or add a new binder-like tree visitor, which can produce a list of known edges between formulas.

320 has been merged into main. The next step is to modify the binder, or add a new binder-like tree visitor, which can produce a list of known edges between formulas.