wangjohn / eager_db

Database management layer for preloading queries.
MIT License
5 stars 0 forks source link

More lazy eager_db #10

Open zmaril opened 10 years ago

zmaril commented 10 years ago

Neat project! I really like the idea of warming caches based on queries that are likely to come next after given queries. I don't like the idea of doing any more work than I already do. I would really just want to sit down, install the gem, specify some learning parameters, and have everything Just Be Faster™.

One way of doing this might be creating markov chains on the fly based on the incoming stream of queries and then firing off queries based on that.

Let's say you see this stream of queries in your logs, which is roughly laid out as date, parameterized query, parameters in json form. (I also changed one of the queries from your example for a specific reason that is below).

[2013:12:7:14:30:00] "SELECT * FROM pies WHERE name = $recipe AND creator = $name AND id=$id" {"recipe":"pecan_pie","name:":"grandma","id":234}
[2013:12:7:14:30:30]  "SELECT * FROM pies WHERE name = $recipe AND creator = $name AND id=$id" {"recipe":"meatloaf","name:":"mom", id:234}
[2013:12:7:14:30:45]  "SELECT * FROM pies WHERE name = $recipe AND creator = $name AND id=$id" {"recipe":"pasta","name:":"mom", "id":24601}
[2013:12:7:14:31:00] "SELECT * FROM pies WHERE name = $recipe AND creator = $name and id=$id" {"recipe":"apple_pie","name:":"grandma", "id":234}
[2013:12:7:14:31:30] "SELECT * FROM recipes WHERE id = $id AND creator = $name" {"id": 234, "name":"grandma"}

Then I would have specified some learning parameters, such as a windowing threshold. Let's say it's 60 seconds and see what we can get.

The inferred markov chain I have in mind would be something like the following:

Vertex with query labels:

1:"SELECT * FROM pies WHERE name = $recipe AND creator = $name"
2:"SELECT * FROM recipes WHERE id = $id AND creator = $name"

Edges with probability of occurrence and pseudoparameters from the previous query to pass into the next query:

1->1  1.0 {"name": $name, "recipe": "*"}
1->2  0.5 {"name": $name, "id": $id}  
2->1  0.0 {}
2->2  0.0 {}

And now, whenever a vertex get's hit, you just precache all the edges that are above a certain threshold in terms of probability (or sample each edge according to calculated probability if you want to get really fancy). One of the big things to note is that the inference seems pretty easy to do if no new information is needed for the next query or if a parameter can be assumed to be *. That's why I changed a query to include the creator's id, otherwise the select all recipes query would be much harder to infer (and would probably have to look at the results of each query and do way too much work). Depending on how smart one made the markov chain inference, it might be able to infer things people never even imagined. I mean who really wants to read all those logs and derive the queries themselves? Not me, I'm hella lazy.

Anyway, this is a neat project and I'm interested to see what it can do.

wangjohn commented 10 years ago

Haha, guess what I'm currently implementing =).

Ya, you raise valid points, and that's the next step in the project (it was actually intended to do this all along). The interesting thing is that the implementation of the prediction part can be completely separate from the infrastructure because of this DSL that I've defined.

My implementation of the prediction was actually going to be almost exactly what you said, since it can be parallelized quite easily (breaking up the logs into k chunks doesn't lose you very much information except at the edges of the break points, and then law of large numbers can give you pretty good guarantees that you don't lose too much information).

zmaril commented 10 years ago

Another comment on the above eagerness:

If you have only lots of reads, then the above could work pretty well. At any point, the cache will always be up to date and you won't have to worry about it too much.

If you start throwing writes into the application, then the cache could easily become invalidated and be holding onto old results. For the above user application of a "It'll be a github, but for, uh, recipes!", if grandma forks mom's recipe after looking at it, then the cache would be invalidated and you're SOL.

Generally, cache invalidation is, uh, hard to say the least. I see at least two ways towards mitigating this complexity, the smart way and the dumb way.

The smart way would be to create a dependency graph of queries. Derive the fact that every time grandma forks mom, then the cache corresponding to all grandma's recipes goes out of date and needs to be refreshed. This could be inferable and might not be too hard if you are into doing logic on SQL ASTS.

The dumb way would be to just bang queries out every N milliseconds. The smarter dumb way is to estimate the distributions and refresh queries slightly before the user does. Thanks to the logs, you have good estimates of the time between edges. As in, you could probably infer that it takes, on average, grandma a minute to read mom's recipe, rage because she is using too much butter, and then decide to fork it. You could then refresh the query a few seconds before you expect grandma to issue it and impress her with your forethought. In the same moment, you could expire the cache after you are reasonably confident that grandma has logged off. In between, you could be refreshing based on some supplied parameter or information from the previous smart way.

I guess, it would be predicting the working set that is needed based on the previous time series of queries and trying to keep it as up to date as possible with statistics and logic.

wangjohn commented 10 years ago

Right, so cache invalidation is hard, and EagerDB doesn't actually deal with it. That's because that "cache" of EagerDB is essentially your original database's cache. This way, you don't actually have to worry about cache invalidation or any of the methods you listed above.

Moreover, this way, if someone comes along with a better implementation of the cache on top of Postgres, you don't have to change anything in EagerDB. Another advantage is that if there were a cache, you'd essentially be creating a second cache on top of the database's pre-existing cache. Two caches aren't that nice, especially if they hold replicated data, which not quite be completely replicated due to the cache invalidation issues you mention.

EagerDB is just going to send these queries to your database and "warm" the cache by hitting the database with these preload queries (it's not even going to store the results). That way, the next time you go to your preloaded query, it will already be in the databases's cache.

Another reason for this design is that you don't have to worry about cache expiration or tweaking parameters like size of memory allocated to the cache -- it's already in your database.

zmaril commented 10 years ago

Nice. The writes aren't an issue then for EagerDB, or, they are only as big an issue as they are for the original database. I blanked out on the design for a second and blurred the lines between the prewarming (heater?) and the actual db itself.

I'll be curious to see how this turns out. Creating the markov chain from the stream of events doesn't seem to hard. Simple random sampling would probably let you sample a small amount of the logs and get a pretty accurate approximation of the graph without messing with parallel code. There are so many workflows in the application that the user can follow. Then you throw some async calls on every query to tell the heater to warm up for the db and you're done.

On another note, I was curious to see what previous work has been done on this. "Predictive caching" produces some interesting papers (if you avoid ones on CPU caches). This one lead me to this one. The latter is a variation on this idea, in the sense that it develops a markov chain of sorts for the links in a document database and prefetches based on that. I've yet to find a paper on warming the db cache based on previous queries though.