handsontable / hyperformula

HyperFormula is an open-source headless spreadsheet for business web apps. It comes with over 400 formulas, CRUD operations, undo-redo, clipboard support, and sorting.
https://hyperformula.handsontable.com/
Other
1.98k stars 109 forks source link

Async functions support #9

Open wojciechczerniak opened 4 years ago

wojciechczerniak commented 4 years ago

Description

Common request for our previous formula engine. Let's keep it as a feature request.

Sample

parser.setFunction('SUM_ASYNC', (params) => {
  return new Promise((resolve) => {
    return setTimeout(() => resolve(params[0] + params[1]), 1000));
  }
});

parser.parse('SUM_ASYNC(2, 3)').then((result, error) => { });
MartinDawson commented 2 years ago

This would be an extremely useful feature, especially for custom functions that need to load data from the backend like our FINANCIAL API.

Currently we preload data on page load but later on will need async functions I think as we can't preload all financial data for everything.

Is this on the roadmap soon and is their an idea of how this will need to be done? Because I could try and do it in a few months when we will need it if there's no plans to implement it yet

MartinDawson commented 2 years ago

@wojciechczerniak I'm going to try and implement this feature because we need it for our site on custom functions.

Could you give some rough guidance on the steps that need to be taken to do this please?

Unless this feature is a massive re-write or something.

wojciechczerniak commented 2 years ago

Key Concepts will be a good start to understand how HyperFormula works with AST. I would start with Evaluator or Interpreter, who calls function and when. Then go from there through each caller to make it async as well. This would stop at public methods, I think. Maybe @bardek8 or @izulin have more detailed ideas. Sorry, I couldn't be more helpful here, not my expertise.

Some of my thoughts and design considerations:

Async and non async

Both function types should work. Preferably without changing everything to automatically resolved promises. We need to discover when function is async or not.

Parallel evaluation

The naive implementation may traverse AST as synchronous code does. One function at a time. But while for synchronous code we cannot do anything about this, the asynchronous code is a different story. If asynchronous functions are used to make HTTP requests, we may parallelize them with Promise.all and continue resolving formulas after all input is ready.

Repetitive ASTs

As mentioned in Key Concepts guide, same functions and ranges are grouped to avoid calculating the same value twice. But does it apply to asynchronous functions? I bet they will not always return same value. ie =REMOTE_GUID() to generate some unique ID, or =RANDOM_PRODUCT() that may return random output for each use case.

Error handling

What will happen if one of the promises is rejected? We need a new ErrorType, and ErrorCode. But most importantly, we need to make sure that we handle Promise.reject well and gracefully.

Retries

Should we retry if Promise is rejected? Or what if a function author loop promises inside his code? Or there are more than one promise and HTTP request within a single function.

Circuit breaker

If we have retries, how many is too many? We should break the loop at some point and display an error (different one, than before?). Even if retries are not our code, we should have a timeout for each asynchronous function in case they take to long to finish

Circular dependencies

I totally have no idea how to handle those. An async function may want to make request based on results of some other functions that are also a result of the formula. Edge case, but may bite us 😵

Named Ranges

They're more or less cells with different addressing. But async functions should be tested with scopes.

Volatile

Does asynchronous functions are always volatile and should be re-requested each time a change to workbook is made? Probably yes if want to keep most of HF logic. Maybe responses should be read from cache when possible then? An external cache should be sufficient, ServiceWorker can be used for example. So maybe we don't have to worry about it within HF code.

Undo-redo

What will happen on undo? Should we restore previous value or make a new async request?

API has to be async

Each method has to be considered separately. But most of them return changes array. Therefore they need to all calculations to end before code can be evaluated further. That means most of API methods has to return Promise<changes> instead just changes.

Before:

const changes = hfInstance.setCellContents({ col: 3, row: 0, sheet: 0 }, [['=B1']]);

After:

const changes = await hfInstance.setCellContents({ col: 3, row: 0, sheet: 0 }, [['=B1']]);

Helper functions

ie. calculateFormula when called will always make a new request when async function was used or referenced ranges contain one. It also should become asynchronous: async calculateFormula(formula: string): Promise<result>

Async functions + Array functions

This combination of features blows my mind 🤯 Returned arraySize should be async as well then. But can we wait until function is fully evaluated and all promises resolved? or should we predict a value first?

Batch

I don't see why it wouldn't work if all the above is done correctly. Just make sure we don't make any requests or resolve promises while in suspendEvaluation state.

Mocks

If you're going to make a PR, please consult library for creating network mocks. I believe we don't have one yet in this project as it wasn't needed. Any new tool should be convenient for @handsontable/devs, as they will maintain the functionality later on.

MartinDawson commented 2 years ago

Thanks for the information. That definitely helps. Will start on it tomorrow and see how it goes.

MartinDawson commented 2 years ago

@wojciechczerniak I've got initial async functions working in custom-functions.spec.ts. Ignore the test file changes. Evaluator.ts & Interpreter.ts are the main changes.

https://github.com/TrackTak/hyperformula/tree/async-functions

Async and non async Both function types should work. Preferably without changing everything to automatically resolved promises. We need to discover when function is async or not.

I'm having some trouble with understanding how this would be possible & specifically what you mean by this?

Currently in the branch I've changed any public function that internally calls recomputeIfDependencyGraphNeedsIt to be async and return a promise.

If we have async functions such as ASYNC_SUM which affect other cells then even functions like addRows() will need to return promises I think. Because it will need to recalculate these cells in an async way.

If the async functions are just functions that hit backend API's and don't depend on other cells then I can see how addRows could stay synchronous because it would never affect another cell and I could add seperate intenral methods that are sync & async versions.

Is this correct or am I wrong here?

I just wanted clarification before I end up changing all the test files if I am wrong.

wojciechczerniak commented 2 years ago

Async and non async Both function types should work. Preferably without changing everything to automatically resolved promises. We need to discover when function is async or not.

I'm having some trouble with understanding how this would be possible & specifically what you mean by this?

I see two options, runFunctionSync + runFunctionAsync or recognize functions that are returning Promise with instanceof Promise.

Currently in the branch I've changed any public function that internally calls recomputeIfDependencyGraphNeedsIt to be async and return a promise.

If we have async functions such as ASYNC_SUM which affect other cells then even functions like addRows() will need to return promises I think. Because it will need to recalculate these cells in an async way.

We're not mixing functions with api methods? All API methods will be async always.

If the async functions are just functions that hit backend API's and don't depend on other cells then I can see how addRows could stay synchronous because it would never affect another cell and I could add seperate intenral methods that are sync & async versions.

Unless we decide to drop requirement to return changes. But IMO it's a very useful functionality.

MartinDawson commented 2 years ago

Error handling What will happen if one of the promises is rejected? We need a new ErrorType, and ErrorCode. But most importantly, we need to make sure that we handle Promise.reject well and gracefully.

I've added a timeoutTime property on the config & a new #TIMEOUT! error. https://github.com/TrackTak/hyperformula/commit/af124a0a42a1d507c389dea40d5d62a6a01a6707

If this is fine I can add the translations.

Parallel evaluation The naive implementation may traverse AST as synchronous code does. One function at a time. But while for synchronous code we cannot do anything about this, the asynchronous code is a different story. If asynchronous functions are used to make HTTP requests, we may parallelize them with Promise.all and continue resolving formulas after all input is ready.

Not too sure how to do this. I think it needs to a change to the algorithm for the sorting of formulaVertexes, because what happens if the async formulas have cell references in it?

For example, the following code:

  const engine = await HyperFormula.buildFromArray([
        [1, '=ASYNC_FOO(A1)', '=ASYNC_FOO(A1+A4)', 6],
        [2, '=ASYNC_FOO(B1)']
      ])

I think requires algorithm change to group the separate tree's so this will be possible:

Promise.all([A2,A3]) & Promise.all([B2])

Currently this.dependencyGraph.topSortWithScc() returns a sorted flat array.

I see two options, runFunctionSync + runFunctionAsync or recognize functions that are returning Promise with instanceof Promise.

~This is not possible.~

~Cannot split out runFunction & runFunctionAsync. All of them must be runFunctionAsync because non-async functions such as SUM() could contain async AST's like =SUM(=PRICE("AMZN"), 2).~

~So the evaluateAst in runFunction must be done in an async manner always like so:~

Volatile Does asynchronous functions are always volatile and should be re-requested each time a change to workbook is made? Probably yes if want to keep most of HF logic. Maybe responses should be read from cache when possible then? An external cache should be sufficient, ServiceWorker can be used for example. So maybe we don't have to worry about it within HF code.

I do not beleive that all async functions should be automatically marked as volatile.

Consider this function: =PRICE("AMZN", "price", DATE(2020,12,31), 1). It needs to be async but it will always return the same value.

Users of the FunctonPlugin should just mark which functions are volatile with the volatile: true flag.

Also for the ServiceWorker cache, I read the docs and I'm not sure how this can work. Different function params could require different cache update times: i.e

=PRICE("AMZN", "price"). =PRICE("AMZN", "revenue")

The second functions network request should be cached for a long time whereas the first should probably be cached for a short period (perhaps every 15 minutes for example) because price updates a lot faster than revenue.

The ServiceWorker cannot know whether to refetch the data or not.

I took a look at the caching strategies here: https://developers.google.com/web/tools/workbox/modules/workbox-strategies

None of these seem to be what I think is needed. These seem to be suited more towards Hyperformula working in offline mode (which seems like another ticket).

Perhaps we should not implement a cache and instead just let users of Hyperformula implement their own caching for these functions on their backend?

This way the network request always fires but it returns a cached value or the updated one always so it's always up to date.

What will happen on undo? Should we restore previous value or make a new async request?

It should make a new network request as mentioned above with the backend cache this should be fine anyway. This means there's no stale data either.

wojciechczerniak commented 2 years ago

Error handling What will happen if one of the promises is rejected? We need a new ErrorType, and ErrorCode. But most importantly, we need to make sure that we handle Promise.reject well and gracefully.

I've added a timeoutTime property on the config & a new #TIMEOUT! error. TrackTak@af124a0

If this is fine I can add the translations.

LGTM 👍 I wouldn't bother with translation until PR is accepted

Parallel evaluation The naive implementation may traverse AST as synchronous code does. One function at a time. But while for synchronous code we cannot do anything about this, the asynchronous code is a different story. If asynchronous functions are used to make HTTP requests, we may parallelize them with Promise.all and continue resolving formulas after all input is ready.

Not too sure how to do this. I think it needs to a change to the algorithm for the sorting of formulaVertexes, because what happens if the async formulas have cell references in it?

For example, the following code:

  const engine = await HyperFormula.buildFromArray([
        [1, '=ASYNC_FOO(A1)', '=ASYNC_FOO(A1+A4)', 6],
        [2, '=ASYNC_FOO(B1)']
      ])

I think requires algorithm change to group the separate tree's so this will be possible:

Promise.all([A2,A3]) & Promise.all([B2])

Currently this.dependencyGraph.topSortWithScc() returns a sorted flat array.

I'm not sure either. Not my expertise, sorry. CC @bardek8 or @izulin

I see two options, runFunctionSync + runFunctionAsync or recognize functions that are returning Promise with instanceof Promise.

This is not possible.

Cannot split out runFunction & runFunctionAsync. All of them must be runFunctionAsync because non-async functions such as SUM() could contain async AST's like =SUM(=PRICE("AMZN"), 2).

So the evaluateAst in runFunction must be done in an async manner always like so:

      const promises = args.map((ast) => this.evaluateAst(ast, state))
      const values = await Promise.all(promises)

      argValues = values.map((value) => [value, false])

I see. Well, that's unfortunate and another breaking change. Just async public methods are enough to make it a feature for HF 2.0, but still another pain to migrate. Unless @bardek8 @izulin @voodoo11 see some other way to do it.

Volatile Does asynchronous functions are always volatile and should be re-requested each time a change to workbook is made? Probably yes if want to keep most of HF logic. Maybe responses should be read from cache when possible then? An external cache should be sufficient, ServiceWorker can be used for example. So maybe we don't have to worry about it within HF code.

I do not beleive that all async functions should be automatically marked as volatile.

Consider this function: =PRICE("AMZN", "price", DATE(2020,12,31), 1). It needs to be async but it will always return the same value.

Users of the FunctonPlugin should just mark which functions are volatile with the volatile: true flag.

I agree. That's a good design decision 👍 Thank you.

Also for the ServiceWorker cache, (...)

Perhaps we should not implement a cache and instead just let users of Hyperformula implement their own caching for these functions on their backend?

Yeah, that's what I had in mind. ServiceWorker is application developer responsibility. If we have volatile and undo-redo sorted out, caching wont be necessary internally.

What will happen on undo? Should we restore previous value or make a new async request?

It should make a new network request as mentioned above with the backend cache this should be fine anyway. This means there's no stale data either.

Agree 👍

wojciechczerniak commented 2 years ago

Retries - Need clarification if @handsontable/devs actually want this or not Circuit breaker - Need clarification if @handsontable/devs actually want this or not

IMO timeout is enough of a circuit breaker. Function author may loop and retry inside function as long as it's within the time limit.

MartinDawson commented 2 years ago

New draft PR: https://github.com/handsontable/hyperformula/pull/863

warpech commented 2 years ago

The PR #863 was abandoned after a sincere attempt by @MartinDawson. Maybe Martin would like to explain here what's his current approach to async functions, if there is any?

FWIW, right now, the easiest way at achieve the asynchronicity is in the user code outside of HyperFormula. I mean a two-step strategy, in which your custom function e.g. SHARE_PRICE("AMZN") synchronously resolves to a value from cache (or gives an error if the cache is empty) while asynchronously fetching the current value. When fetching is done, update the cache and reevaluate getCellValue/getSheetValues. We will soon add such example to the docs #892.

Or, if you need it to run in one pass, simply resolve your async code before feeding the data to Hyperformula.

Implementing async functions as discussed here is not something that we are working on right now. The focus for this quarter is to improve what's available under the existing API (bug fixes, docs).

We might put async functions on the roadmap in the upcoming quarter, because it is obviously useful for some scenarios.

MartinDawson commented 2 years ago

@warpech I'm using this (buggy) PR still: https://github.com/handsontable/hyperformula/pull/879

I couldn't really understand how to fully get your suggestion to work from trying it so I am not doing it that way.

Issues with my draft PR is that I should not have changed evaluator.ts to be async and instead calculated async values in a separate file like the ArraySize is done basically.

There's also issues with async dependencies, array formulas & parallelisation that don't properly work in it in more complex scenario's.

I'm working on it again though bit by bit (as I need it) as I know how to do it much better now I understand HF more so when async functions is picked up in this or a future quarter I can make another PR and should have majority of it done by then.

JohannCooper commented 5 months ago

FWIW, right now, the easiest way at achieve the asynchronicity is in the user code outside of HyperFormula. I mean a two-step strategy, in which your custom function e.g. SHARE_PRICE("AMZN") synchronously resolves to a value from cache (or gives an error if the cache is empty) while asynchronously fetching the current value. When fetching is done, update the cache and reevaluate getCellValue/getSheetValues.

@warpech, I know this thread hasn't been active in a while but is there any update? Is the strategy quoted above still the recommended approach? Thanks!

MartinDawson commented 5 months ago

@JohannCooper Sorry for (very late) reply, I got it working a while back with async cells at each level (i.e promise.all() at each level down the ayclic directed graph tree) but moved onto something else so never got back to this PR

You can see the repo here: https://github.com/TrackTak/hyperformula

Specifically: https://github.com/TrackTak/hyperformula/commit/ca9bbad10e34bff2d5ba50010673875512c80d6d https://github.com/TrackTak/hyperformula/commit/d8739857dc6853ddeb0cd717d24381bac1c8dadb https://github.com/TrackTak/hyperformula/commit/ee54baba545626a9c7a0d0a60c1553a2cdc2e60a

And (unfinished) chunking of async values (which is more performant): https://github.com/TrackTak/hyperformula/commits/chunk-async-formulas

IIRC @warpech suggestion didn't fully work for me for some reason or was not efficient enough, I can't 100% remember