Optimize queries to be 20 times faster

jonasseglare commented 1 year ago

SUMMARY

Background: We noticed that Datahike queries take a lot of time. Using the clj-async-profiler revelealed that Datahike queries with clauses that have multiple unknown variables, e.g. [?c :concept/id ?from-id] take a lot of time because they result in calls to the datahike.db.interface.IDB/-search protocol methods where the database backend is asked to find all matches for those variables. However, often it is possible to substitute a variable for a set of known possible values. For instance if we know that ?c can only be either 119, 120 or 121, the pattern [?c :concept/id ?from-id] can be replaced by [119 :concept/id ?from-id], [120 :concept/id ?from-id] and [121 :concept/id ?from-id].

Proposed improvement: Change the datahike.query engine so that it attempts to substitute variables in patterns to make the patterns more specific. This is done by introducing the function expand-constrained-patterns. Exactly how the substitution should be performed is determined by a relprod strategy in located in the context map at (-> context :settings :relprod-strategy). I tried out a few different strategies:

identity (no-strategy): Don't perform any substitutions
select-simple: Only perform substitutions that result in no or one pattern.
select-all: Perform all possible substitutions
expand-once: This performs all substitutions of select-simple and one more substitution (if possible) that results in more patterns.

I ran a set of 135 test requests against our API and noticed that the best strategy expand-once is about 20 times faster than not trying to substitute any variables (no-strategy).

total_time_plot

The strategies are implemented as the functions clojure.core/identity, relprod-select-simple, relprod-select-all and relprod-expand-once and can be passed by including the strategy at the path [:settings :relprod-strategy] in the query.

TimoKramer commented 1 year ago

Hey @jonasseglare, thanks for submitting this PR. Please give us some time to review this.

whilo commented 1 year ago

@jonasseglare this makes a lot of sense as an optimization and is a very nice contribution! I have thought about similar optimizations before, but did not get to them. Another one would be to to use forward (and potentially reactivated backward) iterators work for < and <= predicates to not first fetch everything from the index and then filter. Finally @jsmassa and I looked into query planning (clause reordering), but have not been able to implement it yet (we mostly worked on ensuring that an ordering is runnable, i.e. all variables are bound when they are needed).

In general it would be good to make optimal choices somewhat automatically (or pick good defaults), since it can be hard for the average use case/user to think about the optimizations. I am not sure which of your option would be a safe default, probably select-simple, but I will check the implementation details.

jonasseglare commented 1 year ago

Thank you for looking into this PR! Yes, the idea of letting the strategy be a parameter is to have a safe default but provide opportunities for better performance for those who need it. I believe select-simple would be a safe default and an improvement over identity which is equivalent to what we currently have, that is no substitutions at all. I have not read up on the literature on this but thought that attempting to substitute variables in clauses would be a fairly simple change that can make a big difference, at least this is what we observe for our use case. And then we can possibly implement something even better later.

I am intrigued by your work on the query planner! How the clauses are ordered is probably related to how variables in the clauses are substituted. Without knowing anything about this topic, I would suppose that processing clauses with few unknown variables first would make it easier to solve clauses with more unknowns later. @whilo Do you have a pointer to a particular query planning algorithm that you plan to implement?

whilo commented 1 year ago

Sorry, for not being super responsive. I am very excited about this PR, but am currently busy with an important PhD milestone that should be done next week and I also need to merge my other PR for distributed datahike first.

Here are two fairly relevant papers for query planning https://dl.acm.org/doi/abs/10.1145/3183713.3183733, https://link.springer.com/chapter/10.1007/978-3-031-16767-6_5. Souffle in particular is the most scalable Datalog implementation I know, but it usually gets compiled ahead of time for a fixed dataset and query and then run once (for static analysis, but there is an interpreter now as well https://souffle-lang.github.io/interpreter.html).

The first step is to get join size estimates, the second step would be a form of abstract interpretation of the query where you compute sizes instead of actual results and then search for the most efficient plan. Postgres for instance still seems to do exhaustive search up to a certain join size (12 joins I think) and then uses a weak evolutionary algorithm to optimize larger joins. One good way to retain sizes of the base relations is to use hyperloglog as xtdb does.

@jonasseglare are you on the Clojurians slack? I think it would be easiest to discuss generally a bit there.

whilo commented 11 months ago

Now that the server is merged, I would focus on getting this merged before improving the writer latency. @jonasseglare can you remove merge conflicts?

jonasseglare commented 11 months ago

@whilo Great to hear! I am investigating unit test that started to fail after I rebased. I will write here again once I am done.

jonasseglare commented 11 months ago

I have rebased this branch on top of the latest changes and solved the conflicts and made all tests pass.

You can move forward with this pull request now if you like :-)

whilo commented 10 months ago

@jonasseglare Lmk if my feedback above was helpful. I still don't fully understand the expand-once strategy, but overall it makes sense how you partially evaluate and expand patterns from the tuples that are already known. In general it would be great to document/formalize it a bit, because that will facilitate reasoning, but that is not needed to get it merged for now.

jonasseglare commented 10 months ago

@whilo Thank you for the review! I think your feedback was very useful, I just haven't had time to address the comments yet.

Regarding the effectiveness of different strategies such as expand-once, I believe it depends a lot on what queries are performed and the shape of the data stored in the database. I still haven't had time to make an easy-to-run benchmark.

whilo commented 10 months ago

Ok, cool! Let me know if I can help in any way. I still want to go through the expand-once process in more detail. My current mental model tells me that the strategies can be suboptimal compared to the current implementation in the case where the current iterator would fetch a bunch of matches at once, while the strategy translates it into individual patterns and then maps them individually, each time incurring the cost of a lookup compared to a faster batched retrieval by the iterator. Another cost is the translation into clauses itself. What is your sense when strategies become suboptimal? In my current understanding partial evaluation should not be slower than grounding on the whole database, do you think this as well?

jonasseglare commented 8 months ago

I have addressed all the comments now. Most importantly, I added a file strategy_test.clj where I compare the running times of different strategies on a synthetic example inspired by the JobTech Taxonomy.

Before we merge, are there any cosmetic changes to be made? For instance, are the names of the strategies relprod-select-all, relprod-select-simple and relprod-expand-once good? When we specify a strategy in a query, is :settings {:relprod-strategy ...} a good way to pass in the strategy to the query engine or do we want to rename some keys?

whilo commented 8 months ago

I have addressed all the comments now. Most importantly, I added a file strategy_test.clj where I compare the running times of different strategies on a synthetic example inspired by the JobTech Taxonomy.

Before we merge, are there any cosmetic changes to be made? For instance, are the names of the strategies relprod-select-all, relprod-select-simple and relprod-expand-once good? When we specify a strategy in a query, is :settings {:relprod-strategy ...} a good way to pass in the strategy to the query engine or do we want to rename some keys?

Thank you! You do not need to put the prefix relprod- into the name of each strategy, as it is redundant.

jonasseglare commented 8 months ago

I made three more changes:

Put back the function resolve-pattern-lookup-refs-or-nil, it is actually needed. I explain why in the docstring.
Changed the default strategy to select-all.
Renamed the strategies by removing the relprod- prefix from their names.

I think I have addressed all comments now.

replikativ / datahike

Optimize queries to be 20 times faster #636

SUMMARY