tonsky / datascript

Immutable database and Datalog query engine for Clojure, ClojureScript and JS
Eclipse Public License 1.0
5.45k stars 304 forks source link

Idea for a reactive and faster alternative to Datascript #132

Open vibl opened 8 years ago

vibl commented 8 years ago

Just an idea

As someone pointed out (#130), a faster Datascript implementation is needed for applications with a lot of entities.

Moreover, it would be great to have a system that would offer reactive views (think auto-refreshed materialized views or live queries). This is not currently possible with Datascript (see #12).

Here is an idea for a reactive and faster alternative (or complement) to Datascript inspired by the Event Sourcing pattern.

The system would consist of:

The datom store could be a simple list of datoms ordered by transaction time.

The datom store would be the single source of truth. Datoms should only be deleted on rare cases, for memory-saving or confidentiality purposes. Thus the datom store could be used as a replayable audit trail for production debugging (together with view queries).

Nested map indexes

Indexes would be nested Clojure maps with three levels. For example, the AVE index: {:attributes {values {eids entities}}}.

They would be highly denormalized in order to minimize lookups: entities would be replicated as maps of {:attributes values} everywhere in the index, apart in type/ref which would only reference eids (to avoid infinite loops in joins). Eids would be referenced as an attribute in the AVE index.

Snapshots of indexes could be stored locally and on the server to speed-up application start-up, i.e. to avoid recomputing indexes from the datom store each time the application is started. They could also be used for an undo-redo feature.

Views

A view would hold the result of a live query run on indexes or on transaction datoms. It would be updated each time a tx-data datom matches a pattern. The default pattern could just be the :where clauses of the query.

Queries with entity joins would run on the indexes. Queries without entity joins might only need to run on entities referred to in the transaction datoms, and if the result could be merged into the corresponding view, it would be much faster than a query on the indexes. These could be called 'delta queries'.

Views could be declared in any part of the application. UI components could subscribe to any of these views and be updated automatically whenever they change.

Transactions history could be used in views for an undo/redo feature for example.

Faster?

The Datascript benchmark1 includes a query (q4) that runs in 40 milliseconds for 20000 entities (see #130):

[:find ?e ?l ?a :where [?e :name "Ivan"]
                                     [?e :last-name ?l]
                                     [?e :age ?a]
                                     [?e :sex :male]]

Let's imagine how this query would run on map-based indexes:

  1. In the AVE index, look for the :name attribute, then look for the "Ivan" value and return the map of entities.
  2. In the AVE index, look for the :sex attribute, then look for the :male value and return the map of entities.
  3. Intersect these two maps.
  4. Keep only the selected attributes (eid, :last-name and :age) in the result.

This query would probably take no more than 4-40 microseconds, even on a large dataset (cf. benchmarks of CLJ and CLJS maps lookups).

What's more, this query (having no entity join) might be turned automatically into a live delta query that would just need to run on tx-data datoms before updating the view.

Comments are welcome!

I'm not experienced enough in Clojure to hack together a proof of concept (with Datalog and Pull queries) yet. I should be in a few weeks/months though. Please keep me in the loop if you're considering working on it.

I am quite new to this community and I don't even know if it is useful to present an idea like this as a suggestion without any code. I probably have overlooked things.

What do you think?

Vianney

haywoood commented 8 years ago

Wouldn't this be re-inventing the wheel a bit? Sounds like om.next?

vibl commented 8 years ago

@lsdafjklsd Which part of this idea sounds like which part of Om Next?

vibl commented 8 years ago

Ok, I guess this idea is so stupid nobody even bothers saying it is. :-)

Fair enough!

Vianney

laforge49 commented 8 years ago

I think it likely that at some point alternatives to the current implementations of datascript/om.next will be developed. Things tend to evolve rapidly. I've contributed to open source for years and not getting a response to an idea is more common than not.

Don't despair. Fork & play. Your depth of understanding will increase manifold.

vibl commented 8 years ago

Thanks @laforge49!

Yes, it's understandable that people don't have the time to test every idea out there, and don't want to express an opinion until they have done so.

I was so sure that my idea had a big flaw I hadn't seen that I thought it would be pointed out by an experienced developer.

tonsky commented 8 years ago

You can materialized queries right now, utilizing DS transaction queue notification. Every time there’s a transaction, update your view.

Imagine you have a query like this:

[:find ?e ?l ?a
 :where [?e :name "Ivan"]
        [?e :last-name ?l]
        [?e :age ?a]
        [?e :sex :male]]

And got a transaction like this:

[[:db/add 788 :age 15]]

What do you suggest your algorithm should do? Given that query was run before, and you have cached result of it and everything else you need. Just want to understand what are you proposing.

vibl commented 8 years ago

The entity involved in the transaction could be pulled from the Entities index (one lookup) after the index has been updated with this transaction (so that the pulled entity contains the new data).

All the "delta queries" (i.e., queries without entity joins) could then be run against this one entity, which would be very cheap.

Then, the query results (if any) could be merged into the corresponding views.

In your example, let's say the pulled entity is :

{:db/id      788
 :name       "Ivan"
 :last-name  "Petrov"
 :sex        :male
 :age        15
 :salary     50000 }

The query result would be:

{:db/id      788
 :last-name  "Petrov"
 :age        15 }

Which, merged into the view corresponding to this query, would update it pretty efficiently.

If the pulled entity's :name is not "Ivan" or its :sex is not :male, then there would be nothing to update.

It's true that this kind of "delta queries" would also work with Datascript as is. On the other hand, the live queries that are not "delta" (the one with entity joins) might not be possible to implement in an efficient way with Datascript according to this thread: #12.

What do you think?

Vianney

tonsky commented 8 years ago

What's pulled entity. I thought you were talking about caching query results. Query results consist of a lot of entities

-----Original Message----- From: "vibl" notifications@github.com Sent: ‎09.‎12.‎2015 18:35 To: "tonsky/datascript" datascript@noreply.github.com Cc: "Nikita Prokopov" prokopov@gmail.com Subject: Re: [datascript] Idea for a reactive and faster alternative toDatascript (#132)

The entity involved in the transaction could be pulled from the Entities index (one lookup) after the index has been updated with this transaction (so that the pulled entity contains the new data). All the "delta queries" (i.e., queries without entity joins) could then be run against this one entity, which would be very cheap. Then, the query results (if any) could be merged into the corresponding views. In your example, let's say the pulled entity is : {:db/id 788 :name "Ivan" :last-name "Petrov" :sex :male :age 15 :salary 50000 } The query result would be: {:db/id 788 :last-name "Petrov" :age 15 } Which, merged into the view corresponding to this query, would update it pretty efficiently. What do you think? Vianney — Reply to this email directly or view it on GitHub.

vibl commented 8 years ago

When I write "pulling an entity", I mean "querying the database to extract the data of an entity". Indeed, this has nothing to do with caching query results.

tonsky commented 8 years ago

Ok, yes, you can pull up other attributes of an entity from a transaction. What do you suggest we should do next? How to insert it into the query?

(btw pulling entity attributes is a very fast operation in current impl — basically two binary searches, so I doubt it’ll change anything in time complexity of query algorithm. We should probably move on to discuss modifications that you suggest for the query)

On Wed, Dec 9, 2015 at 7:45 PM vibl notifications@github.com wrote:

When I write "pulling an entity", I mean "querying the database to extract the data of an entity".

— Reply to this email directly or view it on GitHub https://github.com/tonsky/datascript/issues/132#issuecomment-163240089.

vibl commented 8 years ago

What do you suggest we should do next? How to insert it into the query?

I don't understand your question. Can't "delta queries" be executed against a single entity like I described? And if so, don't you agree that the result of this can be merged into a view to update it efficiently?

As I wrote earlier, the "delta queries" you asked about are not the difficult part here (they can already be implemented in Datascript as is) so we should indeed probably move on to discuss the map-based indexes I suggested. If they can serve queries in a matter of microseconds (as opposed to milliseconds), then they would make live queries possible: any query could be executed again each time the index is updated, because it would be so cheap to do so. Do you have questions about this part of my proposal?

Vianney

tonsky commented 8 years ago

What do you call a “delta query”? A query without join is just a single index lookup. They’re already pretty fast, why optimise them further? And, if they are so fast, why run all of them on transaction time instead of running them on demand?

vibl commented 8 years ago

You're right, I shouldn't have mentioned "delta queries" at all in my proposal (there are only two sentences about them btw) or in my messages. They are not important if we can implement a much faster version of Datascript (they would be if we can't) . Can we forget about them and move on to discuss the nested map indexes I suggested?

Here I rewrote my idea focusing on just nested map indexes to simplify the reading:

Nested map indexes

Indexes would be nested Clojure maps with three levels. For example, the AVE index: {:attributes {values {eids entities}}}.

They would be highly denormalized in order to minimize lookups: entities would be replicated as maps of {:attributes values} everywhere in the index, apart in type/ref which would only reference eids (to avoid infinite loops in joins). Eids would be referenced as an attribute in the AVE index.

Snapshots of indexes could be stored locally and on the server to speed-up application start-up, i.e. to avoid recomputing indexes from the datom store each time the application is started. They could also be used for an undo-redo feature.

How much faster?

The Datascript benchmark1 includes a query (q4) that runs in 40 milliseconds for 20000 entities (see #130):

[:find ?e ?l ?a :where [?e :name "Ivan"]
                                     [?e :last-name ?l]
                                     [?e :age ?a]
                                     [?e :sex :male]]

Let's imagine how this query would run on map-based indexes:

  1. In the AVE index, look for the :name attribute, then look for the "Ivan" value and return the map of entities.
  2. In the AVE index, look for the :sex attribute, then look for the :male value and return the map of entities.
  3. Intersect these two maps.
  4. Keep only the selected attributes (eid, :last-name and :age) in the result.

This query would probably take no more than 4-40 microseconds, even on a large dataset (cf. benchmarks of CLJ and CLJS maps lookups).

Comments are welcome!

I'm not experienced enough in Clojure to hack together a proof of concept (with Datalog and Pull queries) yet. I should be in a few weeks/months though. Please keep me in the loop if you're considering working on it.

I am quite new to this community and I don't even know if it is useful to present an idea like this as a suggestion without any code. I probably have overlooked things.

What do you think?

Vianney

ghost commented 8 years ago

@vibl Nothing of this is new,it's called incremental view maintenance in database speak. If you're interested I could share my folder with papers and books on this. If you look at the most basic way to evaluate datalog programs (semi-naive bottom up), you'll see that it in fact already implements such an algorithm as it iterates the query over old and newly generated facts (which are often denoted as the delta) until a fixpoint is reached. It becomes somewhat tricky however as facts are removed.

I think Datascript/Datomic already implements a system very similar to what you describe. So in order to not reinvent the wheel I'd highly recommend a quick google for scientific papers and books as you come up with ideas :), then you can jumpstart off the stuff people tried already and don't have to repeat their failures. If you want to follow the stream approach for example you will find that there has been some work done already..

Also I'd like to know where you get your timing estimates from (no more than 4-40 microseconds) ;P.

@tonsky Even though this proposal looks somewhat naive, with the proposed implementation just adding noise so far, the general gist of it is correct I think.

Datascript could profit immensely from materialised views and their incremental maintenance. Queries that contain quite a few joins and recursion can have a somewhat gnarly runtime, however when evaluated incrementally their cost can be amortised quite nicely, because the fact-deltas (the sets of added and removed facts within one transaction) are typically small and a join of a X b can be written as (∆ aX b) U (a X ∆b).

Additionally systems like reagent would profit immensely from the ability to specify reactive queries, that get reevaluated as new facts come in. Note that this doesn't have to be eager as you mentioned earlier, it is possible to cash the old query result, while collecting the delta sets, until the materialised query is actually requested (on request-animation-frame or similar), at which point the incremental evaluation algorithm will be run. And best of all we get this basically for free with a datoms tx (even though it's somewhat unfortunate that datascript currently doesn't keep retractions).

Posh tries to solve this problem by providing a query (pattern) to query whether a query needs updating (yo dawg!), but imho this makes matters worse, because now one has to worry about two parts that might go out of sync. A datascript query is already a perfectly fine source of information on when to update itself.

On a additional note, I think this thread shows, that it would be immensely useful to place datascript on a rock-solid theoretical foundation. Not only would it allow us to harvest the years of research done in this field more easily, but it would also give us some common terminology when talking about concepts.

Cheers, JP

vibl commented 8 years ago

Glad to see that I added useful "noise".

Conaws commented 8 years ago

@j-pb I'd love your folder of papers @vibl What's your email? A friend and I are looking into a way to sync multiple datascript dbs together with Horizon or some other kv store, some stuff here I think might apply.

metasoarous commented 7 years ago

@j-pb Hear, hear! I concur; Incremental view maintenance is the right direction. At least, that's the opinion I've formed after researching this question these last few months.

I'd argue Posh does about as good a job as can be expected using DataScript as-is to build materialize views that don't update any more frequently than necessary. At the time of your writing, I think you had to specify the patterns manually, but now they're computed directly from the queries. However, they don't always do the right thing in various edge cases. And at the end of the day, having to recompute the queries remains a huge waste.

It's not perfectly clear to me from the Shallow dive into DataScript internals post whether DataScript does fully-naive or semi-naive query evaluation (my understanding is that most of the difference is in how rules are processed, and there isn't a lot of detail about that in the post). In short though, the semi-naive approach ends up being more efficient than fully-naive by reducing log scans, and uses the ideas behind incremental view maintenance to do this. So as hinted by @j-pb, if DataScript is already using this methodology, we may already be part of the way there, and if not, going in this direction may speed up one off query execution while also paving the way for reactive, incrementally maintained, materialized query views.

@tonsky What do you think about this idea? If you're generally amenable to considering it, I could open up a new issue to discuss particulars of this approach and make a rough proposal along the lines of what it might look like, at least from the API side.

tonsky commented 7 years ago

@metasoarous well, I’d like to hear you thoughts, of course. I agree that recomputing queries would be barely useful from perf standpoint. Feel free to open up new issue to discuss this

garrett-hopper commented 6 years ago

Has there been any more thought put into this recently?

carocad commented 5 years ago

I recently launched rata a very thin layer on top of Datascript for reactive queries with reagent atoms. Views are re-rendered automatically based on the latest results of the queries.

I tried to keep it as lean as possible (just around 20 loc) so I depend a lot on both Datascript query caching and reagent result caching.

It is not based on any fancy papers but it sure works 😅. Your feedback is most welcome :)

metasoarous commented 5 years ago

@carocad This sounds basically like posh, without the machinery for deciding whether to rerun queries.

carocad commented 5 years ago

@carocad This sounds basically like posh, without the machinery for deciding whether to rerun queries.

it is :)

...

it just so happen that the machinery is backed inside reagent. You can check reagent's track function. It automatically decides which queries to re-run based on the components lifecycle and reuses the result of the functions to avoid multiple executions

garrett-hopper commented 5 years ago

@carocad, I've been starting to wonder if something like this would be good enough. It's way simpler, and I don't think the performance overhead of re-running all active queries upon a transaction will be too impactful with the amount of data in an average application. Have you been using this approach on any large projects?

metasoarous commented 5 years ago

I've been starting to wonder if something like this would be good enough.

I'll beg to differ. For small apps, this may be fine. But I don't know why you'd want to hem yourself in this way. Too many queries or too much data and you're going to wish that you had Posh's tx-pattern filters.

it just so happen that the machinery is backed inside reagent.

I think you're missing something; Yes Reagent has functionality to track which queries have updated. However, what it doesn't have that Posh has (and how could it without assuming you would be working with DataScript), is the ability to look at the datoms created in a transaction and compare with patterns in your queries to figure out which queries have a snowball's chance in hell of needing to be updated, and only update those queries. It's not perfect, and misses some edge cases like recursive rules, but it's better than nothing in my opinion.

The real question is why you'd want to implement you're own over use Posh; If for fun, then great! But if you're hoping folks are actually going to use it, it's going to have to have a feature that makes it worth choosing over Posh.

carocad commented 5 years ago

Have you been using this approach on any large projects?

@Jumblemuddle no, so far my frontend apps are not that big. I am looking forward to some feedback on that to be honest :)

The real question is why you'd want to implement you're own over use Posh; If for fun, then great! But if you're hoping folks are actually going to use it, it's going to have to have a feature that makes it worth choosing over Posh.

@metasoarous there is no grand vision on this. I am not trying to make everybody switch, this is not a competition. I saw an opportunity to make it simpler and I took it. Wether people use it in production, play around with it or take it as inspiration is up for everyone to decide.

It's not perfect, and misses some edge cases like recursive rules, but it's better than nothing in my opinion.

That is a feature of rata. Every query and pull pattern that works in Datascript will automatically work in rata. Every performance improvement in Datascript and reagent will benefit rata automatically. Simplicity is a feature.

garrett-hopper commented 5 years ago

I'll beg to differ. For small apps, this may be fine. But I don't know why you'd want to hem yourself in this way. Too many queries or too much data and you're going to wish that you had Posh's tx-pattern

I think the question is how small does the app have to be for it to not matter. My guess would be that quite a few apps simply wouldn't notice the performance impact. Often times, re-running all Datascript queries is 'fast enough'™.

As @carocad stated, simplicity is a feature. I've messed with Posh in the past, but I ended up not liking it due to some of the edge cases I ran into. I don't want to add another layer of complexity to my app.

bhurlow commented 5 years ago

Hi @j-pb, could I also request your email to ask for these papers? would love to read

Folcon commented 2 years ago

So I've been looking into this area and I've been wondering what happened to the discussion here? Have people solved the problem somewhere else? In which case please tell me =)...

What I've looked into / thought about: There is a problem with datascript in cases where the number of queries you have is significant and especially when they can share clauses.

In this case re-running all the queries is not 'fast enough'™.

For example imagine an app with queries like so:

[:find ?e ?a ?l ?n
 :where [?e :age ?a]
        [?e :sex :male]
        [?e :name ?n]
        [?e :last-name ?l]]

[:find ?e ?a ?l ?n
 :where [?e :age ?a]
        [(> ?a 20)]
        [?e :name ?n]
        [?e :last-name ?l]]

[:find ?e ?a ?l ?n
 :where [?e :age ?a]
        [?e :sex :female]
        [(< ?a 60)]
        [?e :name ?n]
        [?e :last-name ?l]]

Except you have say, 2000 queries and 100k entities.

Your goal is to be able to tell:

  1. Do I need to re-run this query?
  2. When queries have shared clauses, can I avoid doing the lookup entirely as I've already done it before?
  3. In fact, seeing as the queries only change on a transact, and the tx-report tells me what's been changed, can I just conj or disj exactly what changed from my query results instead of having to lookup the entire clause?

I believe posh does 1 by caching the whole query, I've not seen it cover 2 or 3.

This is almost certainly outside of datascript's current remit, however I do believe there is value in solving this for those who would like this functionality.

Especially if there was some mechanism by which I could specify queries of interest and then know that they were being incrementally updated.

As I've said, perhaps this is solved elsewhere?

RETE has an answer of a sort to this, but as I understand it, it's focused on detecting whether changes occurred and thus a rule which would change the db needs to be run again. That is useful from my perspective, but efficiently querying the store is what I think would be helpful here. At the very least odoyle which implements RETE does not seem to act as a store unless a rule is defined for the datom. It also doesn't use datalog, opting to create it's own mini-language which makes things a bit harder on the portability side.

So what I've done so far:

  1. Dug into posh starting from the initial commits and working backwards until I got a minimal working case, this has helped me get a better idea of what it's doing currently.
  2. Read into RETE and looked at odoyle and naga enough to know that even though it does solve some of this it's more focused on telling whether rules need to be rerun.
  3. Looked at asami, which is promising, however it is more restricted than datascript, for example you can't call arbitrary functions using db.fn/call like semantics which from my perspective is a really powerful feature of datascript. (to be fair to asami, I can't yet say how much more restrictive it is, just that some of my code as well as a bunch of my queries would require rewrites to port across)
  4. Started with looking at memoising lookup-pattern which would require some changes, to do correctly, as it takes the literal clause, so some rewriting of the clauses is required in the cases where different clauses use different variable names. However doing that the major cost is not here, but higher up where the query information is combined with other clauses, however again it should be possible to skip chunks of that work when the results would be the same.

Possible performance gaps:

  1. Datascript uses satisfies? in places such as lookup-pattern, resolve-pattern-lookup-refs,. -resolve-clause in datascript.query, which considering they're on the main query path, it might be worthwhile seeing if there's any speedups to be gain by swapping to an approach that doesn't use reflection.

I've only been looking into this area a little while and I've not written any complete implementations of this so it's very possible that I've made a mistake in my reasoning somewhere, but otherwise it would be good to see if we can advance this forward as I do believe there's some significant value in having a system that covers this.