Refactor query engine to be faster

jonasseglare commented 4 months ago

This pull request makes queries about twice as fast in the following ways:

Refactor the datahike.query and datahike.db.search namespaces to make more efficient queries using transducers and by moving some loop invariants out of loops, as well as cutting down the number of allocations.
Small tweaks in various places, e.g. by using more efficient datastructures and avoiding allocations.
Faster instantiation and more efficient comparator with the persistent sorted set index.

Here are some results that I computed using the datahike-benchmark repository. The ABS TIME (s) column shows the total time of running the benchmark for different approaches. A shorter time is better. The REL TIME column is the relative time to the line with "Official Datahike".

TARGET                     ABS TIME (s)   REL TIME
Some other db                     4.629         5%
Official Datahike                84.844       100%
Batch search refactoring         42.198        50%

In the above table, this pull request corresponds to the line "Batch search refactoring". The line "Official Datahike" is a recent version of Datahike without the changes in this pull request. And "Some other db" is another database that offers similar functionality to that of Datahike.

TimoKramer commented 4 months ago

Thank you! Will try to take a look soon.

jonasseglare commented 4 months ago

I am currently in a discussion with @whilo about this code and I will likely push more commits later. In other words, the code is not totally stable yet.

jonasseglare commented 4 months ago

I pushed some more commits. We are now down at about 27% of the query time w.r.t. the main branch of Datahike:

TARGET                        ABS TIME (s)   REL TIME
Some other db                        5.098         6%
Official Datahike                   84.844       100%
Batch search refactoring            22.760        27%

These times are specifically for the Persistent Sorted Set index backend, where I have also made some important performance improvements.

jonasseglare commented 3 months ago

I had a very good discussion with @whilo last Friday. We decided to close this PR and split the contributions into smaller PRs:

Rewrite the comparator in datahike.index.persistent-set to be much faster. I expect about 30% less query time.
Refactor the query engine, mainly datahike.query and datahike.db.search, to do less work in the inner loops and use transducers. I expect this will be another 30% less query time.
"Breadcrumb optimizations": Several small and quite local optimizations that don't really change the overall algorithm, such as replacing lazy sequences by transducers in a few places and things like that. I expect about 10% less query time.

All changes in total lead to a bit more than 70% reduction in query time on the benchmark that I have been using.

For the last PR with "breadcrumb optimizations", my plan is to do more rigorous measurements of each change and report them in comments next to the changes. For some of these measurements, it is hard to measure an improvement due to noise. Maybe test only the change on some representative data using Criterium. It is important to understand the tradeoffs made: It may be tempting to use java.util.HashMap but that makes the code less portable than using, say, a transient hash map from Clojure.

Another thing to consider when comparing to "Some other db" is that slicing the result set using offset and limit is not optimized for Datahike: We simply compute the full result set and then slice it. Therefore, it will be as slow as computing the full result set whereas "Some other db" might have a more optimized implementation. It might be interesting to compare the different databases if we choose to not slice and see if the performance gap is smaller.

jonasseglare commented 3 months ago

This PR will be split into smaller PRs. See comment above.

replikativ / datahike

Refactor query engine to be faster #668