Performance issues - Githubissues

fnorf commented 4 years ago

First of all, thank you for this awesome tool! Second of all, I have no clue about JavaScript.

I built some awful thing based on this and was surprised by the CPU usage. I had expected the bottleneck to be the time it took the APIs to answer but actually the data munching of hafas-client into its nice format seems to take the most resources. So I looked into it:

I made a single 24 hour departures request. ~370 trips were returned. The whole process took ~1.7s.

Looking at my network logs my system spent about 0.5s talking with the remote server (reiseauskunft.bahn.de) in total.

So I looked at the process in a profiler (0x). Here is a flamegraph:

full

If I interpret that correctly, it looks like ~60% of the time was spent with findInTree calls in parseCommonData.

I also zoomed in to the two other blocks on the left:

formatstationboard parsearrordep

Some date/time parsing and formatting.

Now, I have no idea what these do and if they are easy to make faster (or used less). Probably not... I saw some commits referencing them in https://github.com/public-transport/hafas-client/commit/2a6b0dc507bd0940ee09b16e1f097499a5e78082 and https://github.com/public-transport/hafas-client/commit/8c6a8d858edf6f75e19294f4066c252e64d11184.

Anyways, since I had looked into it and was surprised by the result, I thought I'd share this with you. There is the tiny and remote chance that you were unaware of these issues and this is useful. And even if not, this was a nice chance to say thanks again for this great project. :)

derhuerst commented 4 years ago

I had expected the bottleneck to be the time it took the APIs to answer but actually the data munching of hafas-client into its nice format seems to take the most resources.

Indeed, I would have expected the hafas-client part to be negligible.

I made a single 24 hour departures request. ~370 trips were returned. The whole process took ~1.7s.

Keep in mind that, based on the usage I've seen so far, this is not the "usual" request because its response is so large: I got 448kib for 1202 departures over 24h at Berlin Friedrichstr.

Some more common use cases:

Searching for 5 locations (stops, addresses, POIs) called "friedrichstr", the DB endpoint returns 6kib of JSON.
Searching for 3 journeys from U Heinrich-Heine-Str. in Berlin to the English Garden in Munich, the DB endpoint returns 26kib of JSON.

Nevertheless, I am of course in favour of making hafas-client faster!

If I interpret that correctly, it looks like ~60% of the time was spent with findInTree calls in parseCommonData.

Keep in mind that, in order to get accurate and reliable results, you need to

invoke hafas-client often enough for the optimizing compiler to detect and optimise hot code paths.
run with NODE_ENV=production, as some libraries behave differently depending on this flag.

derhuerst commented 4 years ago

In parse-benchmark, I set up a small benchmark.

Running it using 0x yields a very similar flame chart to what you reported: index.html.zip

derhuerst commented 4 years ago

Now, I have no idea what these do and if they are easy to make faster (or used less). Probably not...

Probably @simlu knows.

simlu commented 4 years ago

Oh yeah. I breathe this stuff 😉

So there are a few things to consider:

object-scan find function is very small and executed a lot so bench marking might be slightly exaggerated due to overhead of profiler
object-scan is heavily optimized and benchmarked, can probably be sped up more but is already very fast
The response is very large and since the code uses ** it needs to traverse the entire response tree. Much better would be to narrow down what should be targeted and where it could be present in the response tree. That would reduce what is traversed
Instead of calling findInTree multiple times, it should be called once and the callback function should detect which needle target was hit. This would bring the amount of traversal to a single one (instead of scanning the response 10x or more times)

Let me know if you have any questions about above.

I can probably do a pr with these changes. Should not be very hard. However I'm busy this weekend, so it might take a moment. Feel free to ping me if you want to give it a shot

simlu commented 4 years ago

Also I'm expecting the changes proposed above to have a drastic impact on performance for large responses

derhuerst commented 4 years ago

Instead of calling findInTree multiple times, it should be called once and the callback function should detect which needle target was hit. This would bring the amount of traversal to a single one (instead of scanning the response 10x or more times)

This is probably the most significant optimisation we can implement here. But the callback in the 2nd find call may depend on the "output" of the 1st.

The response is very large and since the code uses ** it needs to traverse the entire response tree. Much better would be to narrow down what should be targeted and where it could be present in the response tree.

I don't want to maintain an explicit list of paths where this "reference resolving" (discussed in #127 and #146) needs to happen. I'd rather like to visit every remX field in the tree, every himX, etc.

simlu commented 4 years ago

Instead of calling findInTree multiple times, it should be called once and the callback function should detect which needle target was hit. This would bring the amount of traversal to a single one (instead of scanning the response 10x or more times)

This is probably the most significant optimisation we can implement here. But the callback in the 2nd find call may depend on the "output" of the 1st.

Could be a classic speed vs memory thing. We could do one scan with all possible matches and store them. Then use the stored results as needed. Might be faster than a whole other traversal but use more memory

The response is very large and since the code uses ** it needs to traverse the entire response tree. Much better would be to narrow down what should be targeted and where it could be present in the response tree.

I don't want to maintain an explicit list of paths where this "reference resolving" (discussed in #127 and #146) needs to happen. I'd rather like to visit every remX field in the tree, every himX, etc.

Understandable. But do you not have any semantic knowledge whatsoever? Even baking a little in might help a lot

simlu commented 4 years ago

@derhuerst Can you please point out the dependency that you were talking about? I'm assuming it is in this file. Cheers!

derhuerst commented 4 years ago

Could be a classic speed vs memory thing. We could do one scan with all possible matches and store them. Then use the stored results as needed. Might be faster than a whole other traversal but use more memory

Yeah, there will only be 100s to ~1000 matches, so it should be feasible; It will just increase the amount of GC to be done.

Can you please point out the dependency that you were talking about?

Which dependency? Where did I talk about one?

simlu commented 4 years ago

Which dependency? Where did I talk about one?

The dependency that I had quoted! This one:

But the callback in the 2nd find call may depend on the "output" of the 1st.

I was asking for an example of such dependency. Do you have one handy?

simlu commented 4 years ago

I've implemented the performance improvements here: https://github.com/public-transport/hafas-client/pull/154

The speedup is significant and even noticeable with the tests.

Please take a very careful look. The problem is that the tree is updated in the callback. However currently the newly created branches are not scanned. That could be implemented though (at a small performance penalty). Do you think it is necessary?

Looking forward to your input!

derhuerst commented 4 years ago

I don't want to maintain an explicit list of paths where this "reference resolving" [...] needs to happen. I'd rather like to visit every remX field in the tree, every himX, etc.

Understandable. But do you not have any semantic knowledge whatsoever? Even baking a little in might help a lot

To explain more what I meant: Implicitly, there is a list of e.g. all relevant remX places: All those that the parse functions in parse/* handle.

But for two reasons, I rather want to resolve all:

I want to keep the implicit coupling between parse/common.js and parse/* low. Unfortunately, a lot of parsing in hafas-client is already too coupled, across several levels of indirection. 😕
People can pass in arbitrary, often more comprehensive, parse functions and may want to access the data referenced by a remX that the built-in parse functions in parse/* don't handle yet. This is actually not such an esoteric use case as might sound. 😛

simlu commented 4 years ago

@derhuerst I somewhat understand. That's a bit more complicated than what I was hoping for. Let's see where the pending performance changes bring us and if more optimization is needed.

Just had another though: I maintain another package which might be handy for this project: object-rewrite.

Basically it would allow you to abstract a lot of rewrite logic in plugins, which makes it very easy to reason about changes in isolation and keep dependencies minimal. We use this to modify huge hierarchical data structures in memory (megabytes) where processing time and correctness matter. Feel free to take a look!

derhuerst commented 4 years ago

Just had another though: I maintain another package which might be handy for this project: object-rewrite.

I see the point of it for even more generalised use cases, e.g. user-defined rewriting rules. With hafas-client, I'm not sure if it will actually make the code more maintainable.

derhuerst commented 4 years ago

The changes from #154 have been release as hafas-client@5.1.1! :shipit:

simlu commented 4 years ago

@fnorf Would be awesome if you could give feedback on these performance improvements!

derhuerst commented 4 years ago

The performance hasn't improved much actually. Without:

parse 1200 departures x 1.78 ops/sec ±2.81% (9 runs sampled) (0.56s/run)
parse 45 departures x 34.69 ops/sec ±1.64% (61 runs sampled) (0.03s/run)

With #154:

parse 1200 departures x 2.17 ops/sec ±5.08% (10 runs sampled) (0.46s/run)
parse 45 departures x 42.08 ops/sec ±2.88% (56 runs sampled) (0.02s/run)

Still, thanks @simlu for the optimisation!

simlu commented 4 years ago

@derhuerst That would indicate that the bottleneck is elsewhere. I'd love to take a look myself. Am I using the branch you mentioned above? Which test am I running?

derhuerst commented 4 years ago

Run npm run benchmark or npm run benchmark:profile on the parse-benchmark branch.

simlu commented 4 years ago

Very interesting. I did a big of digging but will have to continue later. What I found curious is that the first execution of the test is double (!) the speed of the other ones. So it looks like the optimization that happens actually hurts the execution speed...?! Maybe time to add more benchmark tests for object-scan...

simlu commented 4 years ago

There is a good chance the test is faulty. Haven't actually looked at it. Did you do a deep clone of the input for each test run? Otherwise the object changes between first and second execution

Edit see here https://github.com/public-transport/hafas-client/commit/c9412025bc707a7be4da93fc18711d04129b7c84

simlu commented 4 years ago

Also had another thought for optimization: Are there areas of the tree that we could explicitly exclude, ie areas where we know there couldn't be any matches?

derhuerst commented 4 years ago

Yes, the object is being mutated. My fault, sorry. https://github.com/public-transport/hafas-client/blob/c9412025bc707a7be4da93fc18711d04129b7c84/benchmark/index.js#L18

simlu commented 4 years ago

I've fixed the test setup in https://github.com/public-transport/hafas-client/pull/155/files

For the following test setup:

measuring execution time of entire parseCommonData function
using the parse 1200 departures test
running the test a few times to allow for optimization

I now get the following results:

Before optimization changes: ~252ms After optimization changes: ~92ms

That is a decrease of 160ms or 63.49%. I'd say that's pretty good :tada:

simlu commented 4 years ago

Also had another thought for optimization: Are there areas of the tree that we could explicitly exclude, ie areas where we know there couldn't be any matches?

What I mean is, are there any objects in the hierarchy that you know wont yield any matches?

paths

E.g. if we knew that common.layerL was never traversed, we could do something like

const needles = [
    "**.oprX",
    "**.icoX",
    "**.prodX",
    "**.pRefL",
    "**.locX",
    "**.ani.fLocX",
    "**.ani.tLocX",
    "**.fLocX",
    "**.tLocX",
    "**.remX",
    "**.himX",
    "**.polyG.polyXL",
    "!common.layerL.**"
  ];

The necessary changes to have that actually impact performance are currently pending here: https://github.com/blackflux/object-scan/pull/876/files

derhuerst commented 4 years ago

What I mean is, are there any objects in the hierarchy that you know wont yield any matches?

E.g. if we knew that common.layerL was never traversed, we could do something like "!common.layerL.**".

Not many. This would be coupling the other way around, similar to what I explained above: I don't want parseCommon to arbitrarily restrict other code in which references are resolved.

derhuerst commented 4 years ago

Results including the benchmark fixes from #155:

with 8cb7d80

parse 1200 departures x 5.56 ops/sec ±5.43% (19 runs sampled) (0.18s/run)
parse 45 departures x 86.17 ops/sec ±1.24% (73 runs sampled) (0.01s/run)

without 8cb7d80

parse 1200 departures x 2.36 ops/sec ±5.51% (11 runs sampled) (0.42s/run)
parse 45 departures x 43.73 ops/sec ±1.30% (57 runs sampled) (0.02s/run)

This includes JSON parsing. I'd dare to say this is good enough™ for now.

simlu commented 4 years ago

The new object scan improved callback performance significantly. Might be worth updating it (breaking, but very simple change)

simlu commented 4 years ago

Here is the PR https://github.com/public-transport/hafas-client/pull/173

public-transport / hafas-client

Performance issues #152

with 8cb7d80

without 8cb7d80