Closed jonasseglare closed 3 months ago
Thank you! Will try to take a look soon.
I am currently in a discussion with @whilo about this code and I will likely push more commits later. In other words, the code is not totally stable yet.
I pushed some more commits. We are now down at about 27% of the query time w.r.t. the main branch of Datahike:
TARGET ABS TIME (s) REL TIME
Some other db 5.098 6%
Official Datahike 84.844 100%
Batch search refactoring 22.760 27%
These times are specifically for the Persistent Sorted Set index backend, where I have also made some important performance improvements.
I had a very good discussion with @whilo last Friday. We decided to close this PR and split the contributions into smaller PRs:
datahike.index.persistent-set
to be much faster. I expect about 30% less query time.datahike.query
and datahike.db.search
, to do less work in the inner loops and use transducers. I expect this will be another 30% less query time.All changes in total lead to a bit more than 70% reduction in query time on the benchmark that I have been using.
For the last PR with "breadcrumb optimizations", my plan is to do more rigorous measurements of each change and report them in comments next to the changes. For some of these measurements, it is hard to measure an improvement due to noise. Maybe test only the change on some representative data using Criterium. It is important to understand the tradeoffs made: It may be tempting to use java.util.HashMap
but that makes the code less portable than using, say, a transient hash map from Clojure.
Another thing to consider when comparing to "Some other db" is that slicing the result set using offset
and limit
is not optimized for Datahike: We simply compute the full result set and then slice it. Therefore, it will be as slow as computing the full result set whereas "Some other db" might have a more optimized implementation. It might be interesting to compare the different databases if we choose to not slice and see if the performance gap is smaller.
This PR will be split into smaller PRs. See comment above.
This pull request makes queries about twice as fast in the following ways:
datahike.query
anddatahike.db.search
namespaces to make more efficient queries using transducers and by moving some loop invariants out of loops, as well as cutting down the number of allocations.Here are some results that I computed using the datahike-benchmark repository. The
ABS TIME (s)
column shows the total time of running the benchmark for different approaches. A shorter time is better. TheREL TIME
column is the relative time to the line with "Official Datahike".In the above table, this pull request corresponds to the line "Batch search refactoring". The line "Official Datahike" is a recent version of Datahike without the changes in this pull request. And "Some other db" is another database that offers similar functionality to that of Datahike.