Refactor query and search for better performance

jonasseglare commented 3 months ago

This code mainly refactors the namespaces datahike.query and datahike.db.search to improve performance of the query engine. This means that we remove the previous "relprod strategies" and the algorithm in this PR best resembles the "select-all" strategy that we had before. I believe there are many things that can contribute to the speed improvement, but this refactoring means that work is moved out from the innermost loop of the query engine and this loop now consists of a transduction inside search-batch-fn. Using transducers probably avoid the creation of lots of short-lived intermediate values and lazy sequences and seem to contribute to the performance. In some cases I use macros to move work out of loops and that comes at a cost in terms of readability. I explored simpler ways of writing the code but in the places where I do use macros to generate code to avoid work at runtime, they seem to be justified, see for example the comments in the function single-substitution-xform.

Here are the results. The results for this PR is labeled Datahike new query/search in the table below:

TARGET                      ABS TIME (s)   REL TIME
Some other db                      4.471         6%
Datahike 0.6.1558                 74.294       100%
Datahike 0.6.1559                 51.827        70%
Datahike new query/search         26.377        36%

The code to run the benchmark is found at https://gitlab.com/arbetsformedlingen/taxonomy-dev/backend/experimental/datahike-benchmark/.

I believe @whilo will want to review this code, he knows what it is about.

whilo commented 2 months ago

Can you rebase this? I hope it is not too annoying.

jonasseglare commented 2 months ago

Thanks @whilo for the review so far! I have rebased and also improved the implementation of distinct-tuples as you requested.

whilo commented 1 month ago

@jonasseglare could you do a rebase again?

jonasseglare commented 4 days ago

Before moving forward with this PR, I suggest we first review the code in https://github.com/replikativ/datahike/pull/691 .

jonasseglare commented 4 days ago

I rebased on the latest main and force-pushed this branch.

replikativ / datahike

Refactor query and search for better performance #674