tonsky / datascript

Immutable database and Datalog query engine for Clojure, ClojureScript and JS
Eclipse Public License 1.0
5.45k stars 304 forks source link

Why are Datascript queries so slow? #130

Open jbsar opened 8 years ago

jbsar commented 8 years ago

The benchmark query q4 takes 40 milliseconds to execute on a 20000 entities database on the JVM (even longer in CLJS).

By contrast, a lookup in a Clojure map of 20000 maps with the same data takes 300 nanoseconds on the JVM. Why does it take 100000 times more in Datascript?

I understand that multiple lookups in the index trees are necessary in order to filter by gender and gather the 3 datoms results. But I would imagine that would take a few microseconds to complete.

I am currently designing a front-end app that will contain a lot of entities (in the 10000s). From what I understand, it would be unusable with Datascript because reaction times would be in the 100s of milliseconds.

Am I missing something?

tonsky commented 8 years ago

This is q4:

["q4" '[:find ?e ?l ?a :where [?e :name "Ivan"]
                              [?e :last-name ?l]
                              [?e :age ?a]
                              [?e :sex :male]]]

What this query does is quite different from map lookup:

For 20000-person database, there’s not a one, but ~2500 Ivans. Half of them have :sex :male, so the size of the result set is ~1250 tuples. Allocating result set, filling it with tuples and creating tuples themselves is all part of the query, and it’s not free. DataScript also does deduplication of result (you wont’s see same tuple twice in the results), which is not free as well.

Also note that this query has 3 joins (2500 × 20000, 2500 × 20000, 2500 × 10000) which are also not free. Because of the way data is stored (flat sorted sets, not hierarchical nested maps), you can’t “just lookup” rest of the structure given its id.

Bottom line is, yes, if you can do with hierarchical nested maps, and don’t need to query them other than lookup-by-id, you better off with them.

DataScript is in different category, so expect different tradeoffs: query speed depends on the size of result set, you need to sort clasuses to have smaller joins, accessing entity properties is not free given its id, etc. As a benefit, you gain ability to query dataset for different projections, forward and reverse reference lookups, joins between different sets, etc. And direct index lookup (datascript.core/datoms) is still fast and comparable to lookup in a map (at least comparable, think binary lookup vs hashtable lookup, logarithm vs constant). Queries do much more than that.

jbsar commented 8 years ago

Thanks for your eye-opening answer!

I guess I will have to find another solution to hold the state of my application. Do you think there's a chance of drastic performance improvements happening (especially in CLJS)?

tonsky commented 8 years ago

I’m optimizing certain sorts of queries in https://github.com/tonsky/datascript/blob/master/src/datascript/query_v3.cljc. But it won’t be orders of magnitude probably. I don’t think it’s possible to have complex queries and high performance at the same time. When you need speed, use direct index lookup.

vibl commented 8 years ago

I presented an idea that might speed up queries: https://github.com/tonsky/datascript/issues/132

It'd be great if you could give some feedback on it. It looks too easy to be right...

Vianney

mhuebert commented 8 years ago

I've been gradually replacing my queries with index lookups and seeing orders-of-magnitude speedups.

Eg. this recursive pull query was taking ~150 milliseconds with my dataset:

(d/q '[:find (pull ?e [:uuid :node/label :node/ui-selected :node/order :node/cell :ui-collapsed {:node/_up ...}]) .
                       :in $ ?uuid :where [?e :uuid ?uuid]] @db uuid)

I replaced it with this function and the time dropped to ~0.1 milliseconds:

(defn get-node-tree [eid]
  (-> (select-keys (d/entity @db eid) [:uuid :node/label :node/ui-selected :node/order :node/cell :ui-collapsed])
      (assoc :node/_up (map (comp get-node-tree first) (d/datoms @db :avet :node/up eid)))))
vibl commented 8 years ago

All this performance data begs a question: could there be better performing algorithms for Datascript? Some that perform in O(log n) rather than O(n) or worse?

EugenDueck commented 7 years ago

@mhuebert That is a huge performance gain! It seems this would be one of the probably many cases where a query optimizer (if something like that already exists in datascript - I'm just starting to look into the code) would shine. Once optimized, repeated invocations of the same query would exhibit hand-optimized performance.