tonsky / datascript

Immutable database and Datalog query engine for Clojure, ClojureScript and JS
Eclipse Public License 1.0
5.49k stars 309 forks source link

treat a query as a progressive funnel instead of independent clauses to search and join #467

Open latacora-paul opened 6 months ago

latacora-paul commented 6 months ago

Hey!

Let's consider this a work in progress since I broke something around lookup refs (will look into it) and I know you might want to make some different implementation decisions. However, the query tests are passing and this greatly reduces the number of datoms being fetched from storage in many cases.

Your fix for https://github.com/tonsky/datascript/issues/462 addressed the "constant / one possibility" case but it didn't do anything to constrain the search space in most other cases (like when there are only a few possible values for a binding as determined by an earlier where clause).

Here I'm constraining the search space of where clauses using candidate binding values (which considerably decreases the number of datoms that need to be scanned for clauses appearing later in a query). In other words, this handles both the "constant" case and allows relations from earlier where clauses to turn later searches into "a collection of constant cases".

tonsky commented 6 months ago

If I understand your approach right:

I agree that this is more precise and will fetch less datoms. I am not sure about the performance of it though. What if first clause returned 100 possible values for ?v? It might be that just fetching all [?e :attr ?v] and then doing a hash-join might be faster? Maybe we need a benchmark to prove it works?

I am also not sure where to draw the line. Sure, for one value of ?v, adding it to the pattern is faster. For maybe 2-3, too. For 100, not so sure. When to stop?

Do you have any insight into how Datomic does it? Or XTDB maybe?

latacora-paul commented 6 months ago

Yeah, what you described is correct. I'm pretty sure datomic does do something like this which is why the ordering of where clauses matters so much there for performance. I see there's some code in datahike to do it too.

Like you said, I am not sure whether it is better in all cases and where the threshold is - likely depends on whether data is already in memory, the size of the set being searched, the number of search patterns remaining from earlier clauses, the ordering of your where clauses, etc.

I will say this considerably reduces the real-world time and number of datoms being fetched for some queries I've been running against a large database using a dynamodb storage backend. Benchmarking sounds reasonable given this is pretty central to query performance (or perhaps even making it optional behavior).

galdre commented 5 months ago

I have also experimented with a farily similar approach to this one, and found that it was sometimes much faster and sometimes far slower for my app; I also experimented with inserting an arbitrary threshold, but overall I strongly suspect (partly due to experience with eva):

  1. It will greatly depend on the shape and scale of your data.
  2. Performance characteristics will dramatically differ based on whether or not you are using persistence (not to mention the latency of the persistence).

One low-hanging approach that might be beneficial to use cases like those of @latacora-paul could be to switch strategies not based on a threshold, but simply based on whether or not the data is persistently stored.

Another approach I've been mulling is to use Clojure/script's metadata to be able to hint to the query engine specific strategies to override defaults.