jsonata-js / jsonata

JSONata query and transformation language - http://jsonata.org
MIT License
2.05k stars 220 forks source link

Mixing sync/async registered functions #499

Open avaidyam opened 3 years ago

avaidyam commented 3 years ago

Hi @andrew-coleman,

I'm using JSONata as kind of a makeshift REST API query engine where each of the API methods are wrapped as async functions registered with JSONata. Right now, even though a function is marked async, doing anything with its result (i.e. $asyncFunc() vs. [ $asyncFunc() ]) causes it to "await" at that point in time. Is there any way to "mark" certain functions as requiring explicit await-ing/then-ing?

Below is my JSONata "query":

$Study_all("dnbd16yj2zkegk67aqm8").{
  "id": id,
  "name": name,
  "participants": $Participant_all(id).{
    "id": id, 
    "name": $Tags_view(id, "lamp.name"),
    "last_activity": $ActivityEvent_all(id, null, null, null, 1),
    "last_sensor": $SensorEvent_all(id, "lamp.analytics", null, null, 1)
  }
}

And here's the sample code I've modified and stubbed with delays of 700ms to test with. Right now it takes about 10530ms to execute where I would like to bring that down to ~2800ms, by parallelizing the inner object's calls to Tags_view, ActivityEvent_all, and SensorEvent_all.

import React, { useState } from "react";
import jsonata from "jsonata";

const delay = async (ms, value) => new Promise(resolve => setTimeout(resolve, ms, value))

function run(expr) {
  return new Promise((resolve, reject) => {
    try {
      var expression = jsonata(expr);
      expression.registerFunction(
        "Study_all",
        async (a) => await delay(700, [{"id":1,"name":"StudyA"},{"id":2,"name":"StudyB"}])
      );
      expression.registerFunction(
        "Participant_all",
        async (a) => await delay(700, [{"id":1},{"id":2}])
      );
      expression.registerFunction(
        "Tags_view",
        async (a) => await delay(700, "TagName"+a)
      );
      expression.registerFunction(
        "ActivityEvent_all",
        async (a) => await delay(700, [{"timestamp":1,"data":"A_"+a},{"timestamp":2,"data":"B_"+a}])
      );
      expression.registerFunction(
        "SensorEvent_all",
        async (a) => await delay(700, [{"timestamp":1,"data":"C_"+a},{"timestamp":2,"data":"D_"+a}])
      );
      expression.registerFunction(
        "Promise_all",
        async (iter) => {
          console.dir(iter)
          return Promise.all(iter)
        },
      );
      // $Promise_all($map([3, 4, 5, 6, 7], $add))
      expression.evaluate({}, {}, (error, result) => {
        if (error) {
          reject(error);
          return;
        }
        resolve(result);
      });
    } catch (e) {
      reject(e);
    }
  });
}

export default function App() {
  const [expr, setExpr] = useState(`$map([3, 4, 5, 6, 7], $add).$`);
  const [state, setState] = useState("");
  const [running, setRunning] = useState(false);
  const [runtime, setRuntime] = useState(null);

  const onClick = async () => {
    setRunning(true);
    const start = new Date().getTime();
    await run(expr).then(setState).catch(setState);
    setRunning(false);
    setRuntime(new Date().getTime() - start);
  };
  return (
    <div className="App">
      <h1>JSONata Test</h1>
      <h2>Start editing to see some magic happen!</h2>
      <textarea
        value={expr}
        onChange={(e) => setExpr(e.target.value)}
        rows="10"
        cols="40"
      />
      <br />
      <div>
        <button onClick={onClick} disabled={running}>
          {running ? "Running..." : "Run"}
        </button>
        {runtime && (
          <span style={{ marginLeft: "10px" }}>{runtime} ms runtime</span>
        )}
      </div>
      <pre>{JSON.stringify(state, null, 2)}</pre>
    </div>
  );
}
markmelville commented 3 years ago

I made a codesandbox of your react app, which shows the ~10s execution time (15 calls of 700ms, not executed in parallel).

I see you started work on a function called Promise_all. I raised an issue a while ago while trying to write custom functions that are async and that use jsonata functions as arguments. I've been using the tips I learned from that work to figure out how make your code execute in parallel, but I'm having a problem I can't figure out. I'll post back soon.

markmelville commented 3 years ago

Good news, I figured out the problem from my previous post, so here it is: an updated codesandbox that runs all your async methods in parallel. The runtime is 2100ms. that's 700 for the call to get studies, 700 to get all participants of all studies (in parallel), and 700 to get info about all participants of all studies (in parallel). The trick is to be able to control the iteration of arrays, so that it will be done in parallel. If you rely on jsonata operators, they will iterate arrays (and keys of objects) serially.

avaidyam commented 3 years ago

Wow, thanks @markmelville! Do you or @andrew-coleman know of any way to avoid the mapParallel and resolveParallel functions here (i.e. build that functionality INTO JSONata itself somehow?) Is this an issue by design or something that could be resolved?

markmelville commented 3 years ago

I've started to try changing the code to be parallel by default. It's not as much as an overhaul as I thought. Someone with more history of the project would have to say if there is a specific reason it's not already that way.

avaidyam commented 3 years ago

@markmelville Do you happen to have a fork with this in progress?

avaidyam commented 3 years ago

@markmelville Bump! Is this code somewhere I can contribute to?

markmelville commented 3 years ago

actually, just last weekend I came back to this, and was doing some performance tests on it in light of #237. I hoped that by doing more in parallel it would perform much better, but the results were actually the opposite. So even though it cuts the execution time for your use-case down, it apparently benchmarked about twice as slow in general. this might not be a good contribution to the project.

https://github.com/markmelville/jsonata/tree/parallelism

avaidyam commented 3 years ago

@markmelville I've worked off of your branch and converted all the generators to async functions. It now seems to work much better and seems to be quite performant! There's still some test cases that need to be resolved/fixed, since there's a dangling promise somewhere that causes unhandled promise rejections. Once that's resolved it should be good to go.

markmelville commented 3 years ago

Well, I do believe @andrew-coleman will want to weigh in on whether the generators can be entirely thrown out. I think having them is what enables synchronous invocation of evaluate. Switching exclusively to async may be something for a next major version.

avaidyam commented 3 years ago

@markmelville I agree that @andrew-coleman should weigh in on this, but I would also like to add that you can sequentially drive async functions instead of firing multiple off in parallel, so if you did that, you would have a similar performance profile to the previous generator version.

andrew-coleman commented 3 years ago

Hi @markmelville and @avaidyam, thank you for experimenting with these ideas. Performance is probably the most important issue at the moment that I would like to get improved, so I'm really glad you are doing some constructive work here.

In my rambling comment here (bullet 2), I forgot to mention that at the time I rewrote the recursive engine to be asynchronous, the async/await syntax was still a proposal for a future JS version, so generators was the only viable option (I tried doing it with promises, but it became a complete mess).

If async/await is more performant than generators, then I am definitely open to making this change. Please could you create and run some benchmark tests to give us an idea of how much better this might be? If it's looking promising, then we can continue discussing under issue #237 so that others interested in performance can test it.

@mattbaileyuk - interested in your thoughts on this.

Thanks again!

avaidyam commented 3 years ago

@andrew-coleman Glad to hear that async/await is fine to use! The reason why async/await would be preferred is to enable async functions (I/O bottlenecked tasks or custom functions) to actually execute in parallel, i.e. making three API requests simultaneously, instead of serially. If each API request took 700ms, the overall execution time would then be 700ms instead of 2100ms.


However, just to see what things looked like, I ran the benchmark test from this link and here are the results:

simple mapping benchmark 1 x 1,042,466 ops/sec ±1.66% (75 runs sampled)
simple mapping benchmark 2 x 18,285 ops/sec ±1.54% (78 runs sampled)
Fastest is simple mapping benchmark 1
complex mapping benchmark 1 x 419,034 ops/sec ±1.85% (74 runs sampled)
complex mapping benchmark 2 x 720 ops/sec ±1.51% (80 runs sampled)
Fastest is complex mapping benchmark 1
complex join benchmark 1 x 9,547 ops/sec ±1.39% (78 runs sampled)
complex join benchmark 2 x 43.13 ops/sec ±1.49% (67 runs sampled)
Fastest is complex join benchmark 1
complex mapping with sort benchmark 1 x 199,038 ops/sec ±1.67% (74 runs sampled)
complex mapping with sort benchmark 2 x 434 ops/sec ±2.04% (77 runs sampled)
Fastest is complex mapping with sort benchmark 1
complex join benchmark 1 x 9,931 ops/sec ±1.73% (77 runs sampled)
complex join benchmark 2 x 44.53 ops/sec ±1.11% (70 runs sampled)
Fastest is complex join benchmark 1

The performance looks about the same whether async/await is used or whether generators are used (JSONata is ~220x slower or in some cases ~550x slower vs. pure JS). Considering that JS is JIT-compiled by WebKit and V8, I feel that it may be impossible to reach that benchmark, and instead it might be worth comparing to tools like jq and libraries like JMESPath that are similar to JSONata. Considering that simple mapping was 50x slower but complex mapping was ~500x slower, it seems fair to say that something about that example (reproduced below) is causing a 10x performance drop.

laureates.{
  "name": knownName.en ? knownName.en : orgName.en,
  "gender": gender,
  "prizes": nobelPrizes.categoryFullName.en[]
}^(name)

I replaced the merge sort implementation in functions.js/sort with the JS array sort function (which doesn't work for all cases) and re-ran the tests:

simple mapping benchmark 1 x 996,417 ops/sec ±1.65% (76 runs sampled)                    
simple mapping benchmark 2 x 17,536 ops/sec ±1.92% (75 runs sampled)                     
Fastest is simple mapping benchmark 1                                                    
complex mapping with sort benchmark 1 x 202,783 ops/sec ±1.67% (78 runs sampled)         
complex mapping with sort benchmark 2 x 561 ops/sec ±2.20% (73 runs sampled)             
Fastest is complex mapping with sort benchmark 1                                         

So simply swapping out the sort function made JSONata only ~350x slower instead of ~500x, so it could potentially be about ~30% faster for sorting operations. (I'm hoping my logic with comparing performance loss makes sense and is correct enough.)

Hope that helps!

mattbaileyuk commented 3 years ago

Sorry for the slow response here.

I would generally be in favour of moving to async/await as that seems to be the more modern pattern to adopt, and has certainly been useful in some other projects to clean up the code by removing big promise chains, etc. As the current pattern it does also benefit from additional support, and node.js 12 (the minimum supported node.js level as of the end of this month) included performance improvements for async/await.

Of course, it's not a quick undertaking, so we may still need to determine if the benefits are worth it. The data from @avaidyam is really helpful (thanks!) - it seems like if we want to focus on performance benefits there may be other areas to focus on first. Although the cleaner code is also appealing - maybe we can do it bit by bit, and gradually phase the generators out.

avaidyam commented 3 years ago

I wonder if using a strategy like CAF to enable backwards compatibility with generator functions will make this an easier PR to merge into 1.x instead of as a 2.x release?