replikativ / datahike

A fast, immutable, distributed & compositional Datalog engine for everyone.
https://datahike.io
Eclipse Public License 1.0
1.62k stars 95 forks source link

Refactor query engine to be faster #668

Closed jonasseglare closed 3 months ago

jonasseglare commented 4 months ago

This pull request makes queries about twice as fast in the following ways:

Here are some results that I computed using the datahike-benchmark repository. The ABS TIME (s) column shows the total time of running the benchmark for different approaches. A shorter time is better. The REL TIME column is the relative time to the line with "Official Datahike".

TARGET                     ABS TIME (s)   REL TIME
Some other db                     4.629         5%
Official Datahike                84.844       100%
Batch search refactoring         42.198        50%

In the above table, this pull request corresponds to the line "Batch search refactoring". The line "Official Datahike" is a recent version of Datahike without the changes in this pull request. And "Some other db" is another database that offers similar functionality to that of Datahike.

TimoKramer commented 4 months ago

Thank you! Will try to take a look soon.

jonasseglare commented 4 months ago

I am currently in a discussion with @whilo about this code and I will likely push more commits later. In other words, the code is not totally stable yet.

jonasseglare commented 4 months ago

I pushed some more commits. We are now down at about 27% of the query time w.r.t. the main branch of Datahike:

TARGET                        ABS TIME (s)   REL TIME
Some other db                        5.098         6%
Official Datahike                   84.844       100%
Batch search refactoring            22.760        27%

These times are specifically for the Persistent Sorted Set index backend, where I have also made some important performance improvements.

jonasseglare commented 3 months ago

I had a very good discussion with @whilo last Friday. We decided to close this PR and split the contributions into smaller PRs:

All changes in total lead to a bit more than 70% reduction in query time on the benchmark that I have been using.

For the last PR with "breadcrumb optimizations", my plan is to do more rigorous measurements of each change and report them in comments next to the changes. For some of these measurements, it is hard to measure an improvement due to noise. Maybe test only the change on some representative data using Criterium. It is important to understand the tradeoffs made: It may be tempting to use java.util.HashMap but that makes the code less portable than using, say, a transient hash map from Clojure.

Another thing to consider when comparing to "Some other db" is that slicing the result set using offset and limit is not optimized for Datahike: We simply compute the full result set and then slice it. Therefore, it will be as slow as computing the full result set whereas "Some other db" might have a more optimized implementation. It might be interesting to compare the different databases if we choose to not slice and see if the performance gap is smaller.

jonasseglare commented 3 months ago

This PR will be split into smaller PRs. See comment above.