Closed jonasseglare closed 8 months ago
Hey @jonasseglare, thanks for submitting this PR. Please give us some time to review this.
@jonasseglare this makes a lot of sense as an optimization and is a very nice contribution! I have thought about similar optimizations before, but did not get to them. Another one would be to to use forward (and potentially reactivated backward) iterators work for <
and <=
predicates to not first fetch everything from the index and then filter. Finally @jsmassa and I looked into query planning (clause reordering), but have not been able to implement it yet (we mostly worked on ensuring that an ordering is runnable, i.e. all variables are bound when they are needed).
In general it would be good to make optimal choices somewhat automatically (or pick good defaults), since it can be hard for the average use case/user to think about the optimizations. I am not sure which of your option would be a safe default, probably select-simple
, but I will check the implementation details.
Thank you for looking into this PR! Yes, the idea of letting the strategy be a parameter is to have a safe default but provide opportunities for better performance for those who need it. I believe select-simple
would be a safe default and an improvement over identity
which is equivalent to what we currently have, that is no substitutions at all. I have not read up on the literature on this but thought that attempting to substitute variables in clauses would be a fairly simple change that can make a big difference, at least this is what we observe for our use case. And then we can possibly implement something even better later.
I am intrigued by your work on the query planner! How the clauses are ordered is probably related to how variables in the clauses are substituted. Without knowing anything about this topic, I would suppose that processing clauses with few unknown variables first would make it easier to solve clauses with more unknowns later. @whilo Do you have a pointer to a particular query planning algorithm that you plan to implement?
Sorry, for not being super responsive. I am very excited about this PR, but am currently busy with an important PhD milestone that should be done next week and I also need to merge my other PR for distributed datahike first.
Here are two fairly relevant papers for query planning https://dl.acm.org/doi/abs/10.1145/3183713.3183733, https://link.springer.com/chapter/10.1007/978-3-031-16767-6_5. Souffle in particular is the most scalable Datalog implementation I know, but it usually gets compiled ahead of time for a fixed dataset and query and then run once (for static analysis, but there is an interpreter now as well https://souffle-lang.github.io/interpreter.html).
The first step is to get join size estimates, the second step would be a form of abstract interpretation of the query where you compute sizes instead of actual results and then search for the most efficient plan. Postgres for instance still seems to do exhaustive search up to a certain join size (12 joins I think) and then uses a weak evolutionary algorithm to optimize larger joins. One good way to retain sizes of the base relations is to use hyperloglog as xtdb does.
@jonasseglare are you on the Clojurians slack? I think it would be easiest to discuss generally a bit there.
Now that the server is merged, I would focus on getting this merged before improving the writer latency. @jonasseglare can you remove merge conflicts?
@whilo Great to hear! I am investigating unit test that started to fail after I rebased. I will write here again once I am done.
I have rebased this branch on top of the latest changes and solved the conflicts and made all tests pass.
You can move forward with this pull request now if you like :-)
@jonasseglare Lmk if my feedback above was helpful. I still don't fully understand the expand-once strategy, but overall it makes sense how you partially evaluate and expand patterns from the tuples that are already known. In general it would be great to document/formalize it a bit, because that will facilitate reasoning, but that is not needed to get it merged for now.
@whilo Thank you for the review! I think your feedback was very useful, I just haven't had time to address the comments yet.
Regarding the effectiveness of different strategies such as expand-once, I believe it depends a lot on what queries are performed and the shape of the data stored in the database. I still haven't had time to make an easy-to-run benchmark.
Ok, cool! Let me know if I can help in any way. I still want to go through the expand-once process in more detail. My current mental model tells me that the strategies can be suboptimal compared to the current implementation in the case where the current iterator would fetch a bunch of matches at once, while the strategy translates it into individual patterns and then maps them individually, each time incurring the cost of a lookup compared to a faster batched retrieval by the iterator. Another cost is the translation into clauses itself. What is your sense when strategies become suboptimal? In my current understanding partial evaluation should not be slower than grounding on the whole database, do you think this as well?
I have addressed all the comments now. Most importantly, I added a file strategy_test.clj
where I compare the running times of different strategies on a synthetic example inspired by the JobTech Taxonomy.
Before we merge, are there any cosmetic changes to be made? For instance, are the names of the strategies relprod-select-all
, relprod-select-simple
and relprod-expand-once
good? When we specify a strategy in a query, is :settings {:relprod-strategy ...}
a good way to pass in the strategy to the query engine or do we want to rename some keys?
I have addressed all the comments now. Most importantly, I added a file
strategy_test.clj
where I compare the running times of different strategies on a synthetic example inspired by the JobTech Taxonomy.Before we merge, are there any cosmetic changes to be made? For instance, are the names of the strategies
relprod-select-all
,relprod-select-simple
andrelprod-expand-once
good? When we specify a strategy in a query, is:settings {:relprod-strategy ...}
a good way to pass in the strategy to the query engine or do we want to rename some keys?
Thank you! You do not need to put the prefix relprod-
into the name of each strategy, as it is redundant.
I made three more changes:
resolve-pattern-lookup-refs-or-nil
, it is actually needed. I explain why in the docstring.select-all
.relprod-
prefix from their names.I think I have addressed all comments now.
SUMMARY
Background: We noticed that Datahike queries take a lot of time. Using the clj-async-profiler revelealed that Datahike queries with clauses that have multiple unknown variables, e.g.
[?c :concept/id ?from-id]
take a lot of time because they result in calls to thedatahike.db.interface.IDB/-search
protocol methods where the database backend is asked to find all matches for those variables. However, often it is possible to substitute a variable for a set of known possible values. For instance if we know that?c
can only be either119
,120
or121
, the pattern[?c :concept/id ?from-id]
can be replaced by[119 :concept/id ?from-id]
,[120 :concept/id ?from-id]
and[121 :concept/id ?from-id]
.Proposed improvement: Change the
datahike.query
engine so that it attempts to substitute variables in patterns to make the patterns more specific. This is done by introducing the functionexpand-constrained-patterns
. Exactly how the substitution should be performed is determined by a relprod strategy in located in thecontext
map at(-> context :settings :relprod-strategy)
. I tried out a few different strategies:identity
(no-strategy
): Don't perform any substitutionsselect-simple
: Only perform substitutions that result in no or one pattern.select-all
: Perform all possible substitutionsexpand-once
: This performs all substitutions ofselect-simple
and one more substitution (if possible) that results in more patterns.I ran a set of 135 test requests against our API and noticed that the best strategy
expand-once
is about 20 times faster than not trying to substitute any variables (no-strategy
).The strategies are implemented as the functions
clojure.core/identity
,relprod-select-simple
,relprod-select-all
andrelprod-expand-once
and can be passed by including the strategy at the path[:settings :relprod-strategy]
in the query.