mozilla / mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
https://mozilla.github.io/mentat/
Apache License 2.0
1.65k stars 115 forks source link

Extensible persistence support (e.g. React Native, browser)? #165

Closed alexandergunnarson closed 7 years ago

alexandergunnarson commented 7 years ago

I'm really interested in using persistent Datomic implementations across the board (on the backend on the JVM, and on the frontend in the browser or via React Native). Of course, Datomic is great on the JVM, and DataScript is all right for ephemeral (non-persistent) use cases in any JavaScript environment, but Datomish (now Mentat?) caught my eye as a possible candidate for a general JavaScript Datomic implementation with persistence (something DataScript obviously lacks).

However, it seems that Datomish is focused on NodeJS and Firefox. Is there a way to straightforwardly hook up Datomish to e.g. ReactNative's AsyncStorage or the browser's IndexedDB? I'm totally fine with contributing, but I'm looking to avoid reinventing the wheel ;)

Thanks for your help!

rnewman commented 7 years ago

You're the second person I've seen ask about whether Mentat would be suitable as a persistent datom store for the web.

This answer could get very long, so I'll try to keep it short!

tl;dr: "straightforwardly hook up" — absolutely not.

There are two broad reasons why not: implementation choices and web limitations.

The proof-of-concept version of Mentat was built in ClojureScript, so in principle you could use it in a web context or anywhere else you could run JS… if you don't mind having an opaque 600KB lump of compiled JS to debug, churned out by an inscrutable 60-second build step through Closure Compiler. We didn't like that situation — no way would we be willing to support that code for hundreds of millions of users in Firefox — and so a rewrite was in order. We don't intend to spend any further time on the ClojureScript version, so you couldn't rely on bugs being fixed.

Our rewrite is in Rust, which you could try compiling with emscripten to get JS/WebAssembly that you could run in a browser. But somewhere at the bottom of the stack is SQLite, trying to memory-map files and control fsync calls. Without the deprecated WebSQL, there's no suitable foundation on the web itself.

You could try using the Rust binary as a custom component from React Native. And you could use nativeMessaging in a WebExtension. But that requires a bit more work than just dropping a JS library into your toolchain.

Mentat is quite closely tied to SQLite: it relies on a SQL engine's transactions, constraints, full-text search, and more, and it expects full control over everything for performance. A Datalog query is translated into a single SQL query for execution.

That makes Mentat a poor fit to IndexedDB, which pushes an abstraction layer of its own, and requires third-party code for FTS. React Native's AsyncStorage is itself layered on top of SQLite (or RocksDB, or flat files!), and is too simple for our needs.

One could imagine a datom store that's backed by IndexedDB or AsyncStorage, in the same way that we're backed by SQLite, but I would be worried about the performance of such a solution.

Could such a datom store share code with Mentat? Maybe. The EDN and query parsers, certainly, but the majority of the code is about implementing transacts, connection handling, and translating queries, and all of that would need to be abstracted and reimplemented… not to mention figuring out how a Rust library is sensibly shipped as a consumer of a JavaScript API.

In the longer term, a good answer might be exposing Mentat through a WebExtensions API, so you could use it instead of IndexedDB.

ncalexan commented 7 years ago

However, it seems that Datomish is focused on NodeJS and Firefox. Is there a way to straightforwardly hook up Datomish to e.g. ReactNative's AsyncStorage or the browser's IndexedDB? I'm totally fine with contributing, but I'm looking to avoid reinventing the wheel ;)

I started to type a message saying you might have a better experience trying to implement the DataScript DB protocol backed by IndexedDB, but then I remembered that IndexedDB has an asynchronous access pattern, which really doesn't play well with DataScript's synchronous implementation. I do wonder if there's some middle ground where you make DataScript-in-CLJS use JavaScript generators so it can be both synchronous and pseudo-asynchronous.

Sadly, I'll concur with @rnewman -- Mentat/CLJS is not built to do what you want, and Mentat/Rust is not going to do what you want either. Sorry!

bgrins commented 7 years ago

Our rewrite is in Rust, which you could try compiling with emscripten to get JS/WebAssembly that you could run in a browser. But somewhere at the bottom of the stack is SQLite, trying to memory-map files and control fsync calls. Without the deprecated WebSQL, there's no suitable foundation on the web itself.

There is prior work that compiles SQLite to js with https://github.com/kripken/sql.js. Details about persisting the db on a page with https://github.com/kripken/sql.js/wiki/Persisting-a-Modified-Database / http://kripken.github.io/sql.js/examples/persistent.html.

rnewman commented 7 years ago

Yeah, Kripken hosts a SQLite 'file' as an in-memory byte buffer, relying on the caller to figure out when to dump a few megabytes of byte array into some kind of persistent storage, and making sure that no concurrent modifications occur.

Kripken is very cool, and It might be fine for relatively small workloads, but because everything is in memory and has to be manually persisted, at that point you might as well use DataScript — you're in the same boat, but the code is already written and is meant to be used on the web.

alexandergunnarson commented 7 years ago

First of all, thank you all so much for your quick and generously long replies!

To @rnewman :

I can't say I'm on board with your reasoning for switching away from ClojureScript, as "an opaque 600KB lump of compiled JS to debug", as you say, could apply to probably many libraries of comparable size, especially ones which use a feature-rich compile-to-JS-language like ClojureScript. I'm not sure that the build/optimization step is as "inscrutable" as you say, or why 60 seconds of compilation/optimization time is an abomination when it often takes e.g. C++ code tens or hundreds of times that to achieve the same result (ignoring incremental builds, which ClojureScript also has via e.g. Figwheel). But to each his own, I suppose :) That said, I do see the raw speed advantage of Rust.

The Rust -> WebAssembly possibility sounds intriguing, but as you say, without hooking into the deprecated WebSQL API, it seems like that wouldn't achieve what I'd want to achieve.

Thanks for making me aware that Mentat is so tightly coupled to SQL. However, from what I've read (and I could be wrong), IndexedDB is pretty performant, right? I can't speak for AsyncStorage, because I haven't investigated it as much, and my guess is that it's not performance-optimized. What makes you worry about IndexedDB's performance as compared to SQLite?

Using the WebExtensions API is probably straightforward, yes, but not the route I'm looking to go for, as I'm looking to avoid plugins wherever possible.

The Rust binary as a custom component from React Native doesn't sound too terrible at all actually. Maybe a little painful / not entirely straightforward the first time around, but once everything is in place, it's just a matter of tweaking e.g. the React Native header files when/if Mentat API changes take place. It's certainly a possibility for non-browser environments.

To @ncalexan :

IndexedDB has an asynchronous access pattern, which really doesn't play well with DataScript's synchronous implementation.

That probably is semi-true, but it's entirely possible to use e.g. the following in ClojureScript:

; under the hood, use IndexedDB's callback with a promise-chan,
; then (potentially) transacts to the in-memory DataScript DB
(go (<! (persistent-transact! conn txn-data))
    ...)

in the same way one uses the following in Clojure:

@(transact conn txn-data)

What other concerns do you have about IndexedDB's asynchrony?

To @bgrins:

Thanks so much for your helpful link about sql.js! I'll look into that, even if @rnewman 's point about it having to be manually persisted still holds.

rnewman commented 7 years ago

For context, I've been writing Lisps for over fifteen years, Clojure for seven or eight. I'm no stranger to bad tooling; we wrote Firefox in pre-stable Swift. Iterative development of Mentat in Clojure wasn't bad. Deploying to two or three different JS destinations with CLJS, and debugging there, was an absolute nightmare.

We had a regression traced to a minor CLJS update, for example, which is terrifying. I had to resort to bisecting, because there's no way I could figure out what was going wrong: heavy use of channels plus a compilation step or two plus inadequate source mapping meant printf debugging, and the bug only occurred in the wild, with a 70-second rebuild with each print statement…

Rust's tooling is pretty great, by comparison.

ncalexan commented 7 years ago

That probably is semi-true, but it's entirely possible to use e.g. the following in ClojureScript:

If you look at the DataScript implementation, you'll see that the DB protocol really needs to be synchronous. I tried to make DS use go blocks, etc and I found it not possible. Others have tried as well, I believe... see https://github.com/tonsky/datascript/issues/190 and other links.

rnewman commented 7 years ago

IndexedDB is pretty performant, right? I can't speak for AsyncStorage, because I haven't investigated it as much, and my guess is that it's not performance-optimized. What makes you worry about IndexedDB's performance as compared to SQLite?

Although I do doubt whether implementing Mentat's ideas on top of IndexedDB would result in a library that was as fast as doing so with direct SQL, my concern is more about the depth of the layers of abstraction.

IndexedDB in Firefox is implemented as a layer (written in C++) on top of SQLite, so adding yet another layer on top for Mentat seems… wasteful. IndexedDB also doesn't give fine-grained control of e.g., result prefetching, page sizes (naturally), vacuum behaviors, and all the other techniques we expect to need. And, as previously mentioned, even though it's SQLite underneath, we wouldn't be able to get access to FTS.

alexandergunnarson commented 7 years ago

@ncalexan :

Interesting. I apparently haven't dug into the internals of DataScript as deeply as I needed to in order to make an informed statement on whether it can or cannot be used asynchronously — my apologies. The link you posted is extremely useful but disheartening because what I'd really like is a JS implementation of Datomic with pluggable storage services à la, well, Datomic — except instead of e.g. DynamoDB or Cassandra, it might use e.g. IndexedDB in the browser or React-Native-compatible SQLite or whatever.

I'm not sure how to effectively proceed with a persistent implementation of Datomic in the browser without the bad options of either 1) magically cobbling together a bridge between IndexedDB and DataScript (low likelihood of straightforward progress) or 2) using a plugin for Mentat:Rust (I'm set on avoiding plugins for what I'm doing).

A persistent implementation of Datomic seems more straightforward with React Native, in which custom native modules are possible (cue Mentat:Rust), but (referencing the link you gave me) there would apparently be issues with the "database as a value" during rendering with any asynchronous persistence implementation, and my guess is that a React-Native-ready Mentat:Rust would require asynchrony just like the other options.

@rnewman :

I definitely feel your pain about debugging in ClojureScript. ClojureScript exceptions in the browser can be extremely unhelpful. React Native exceptions are essentially entirely opaque without some version of e.g. printf as you said; this is certainly also the case even with inadequately-source-mapped ClojureScript in a non-React-Native context (e.g. Node). And as to the rebuild, I thought you were talking about just the advanced compilation build stage usually used for (production) deployment. I, too, have experienced the pain of a "70-second rebuild with each print statement" — actually, probably more like 150, depending on how deeply down the dependency graph I changed something. This is even with Figwheel, which does incremental compilation, as I'm sure you know — Bruce Hauman, the creator of Figwheel, was on ScreenHero with me the other day and was shocked at how long it took. I've had to forego the convenience of having all my code auto-rebuildable and do a lein install on the source paths I wanted to exclude from auto-rebuilding. To this day, after working on the backend in Clojure for a while, I dread going back to ClojureScript land where it seems that after creating a certain critical mass of code, my progress is effectively halted by the prohibitive (even incremental) recompilation time.

I've never tried Rust, but I'm glad that you've found a good fit with it! I've only explored it very shallowly, but it looks like a solid language.

alexandergunnarson commented 7 years ago

@rnewman :

I've never really researched why WebSQL was deprecated in favor of IndexedDB (something about wanting multiple storage implementations, i.e. not just SQLite?), but it seems a shame, because yes, all those layers of abstraction really do seem wasteful, as you say. If WebSQL weren't deprecated, would you say you might even recommend using that as Mentat's storage backend in the browser?

rnewman commented 7 years ago

I can't say I'm on board with your reasoning for switching away from ClojureScript, as "an opaque 600KB lump of compiled JS to debug", as you say, could apply to probably many libraries of comparable size, especially ones which use a feature-rich compile-to-JS-language like ClojureScript. I'm not sure that the build/optimization step is as "inscrutable" as you say, or why 60 seconds of compilation/optimization time is an abomination when it often takes e.g. C++ code tens or hundreds of times that to achieve the same result (ignoring incremental builds, which ClojureScript also has via e.g. Figwheel).

600KB is the optimized version. IIRC the unoptimized single-file version was 6MB, which is about as big as Firefox 3's zip. (Devtools wasn't too happy about a few hundred thousand lines of code.) And this from an incomplete early version of the library, only 6500 lines of Clojure. I dread to think how big it would have got with a fully fleshed out query engine, good logging and performance tracing, etc.

The compiled output was also scarily bad: duplicate argument names within the same JS function, for example. And Closure Compiler is clearly intended to target a web context: it typically depends on run-time user-agent sniffing in navigator to choose how to implement features. This is fragile stuff, and not well suited to our needs.

For comparison: my machine builds all of Firefox — something like six million lines of code — in fifteen minutes, and can do an artifact build of Firefox for Android (300K lines of Java) in a minute or two. Spending 70 seconds just to produce Mentat was baffling.

The idea that we might get a crash or a bug in release and have essentially no usable stack was a strong disincentive. And if someone who's been writing Common Lisp and Clojure since some of his coworkers were in grade school can introduce subtle bugs¹ that take hours to track down, I dread to think what coworkers who are used to other languages might experience.

¹ My 'favorite' was accidentally transposing the arguments to put! inside an error handler within a go block. An unrelated change somewhere else, fifty commits later, caused an error like does not implement IChan with no useful stack. Hooray for dynamic typing! I needed to sleep on that one and printf-debug it.

I've never really researched why WebSQL was deprecated in favor of IndexedDB (something about wanting multiple storage implementations, i.e. not just SQLite?), but it seems a shame, because yes, all those layers of abstraction really do seem wasteful, as you say. If WebSQL weren't deprecated, would you say you might even recommend using that as Mentat's storage backend in the browser?

Web standards mostly don't encourage standardizing a single, specific implementation; from that perspective I agree with deprecating WebSQL. If it were possible for web content to run arbitrary SQL on a database, with control over transaction boundaries, then yes, one could absolutely use it as a storage backend for Mentat.

alexandergunnarson commented 7 years ago

@rnewman :

IIRC the unoptimized single-file version was 6MB, which is about as big as Firefox 3's zip.

I agree that that's crazy (though I've had ridiculous code sizes in my own experience too, like ~18MB unoptimized). That said, the "only 6500 lines of Clojure" probably required several large libraries, not the least of which was the sizable Google Closure Library. Too bad Closure's tree-shaking claims don't really hold up here, probably because of the dynamism inherent to ClojureScript.

duplicate argument names within the same JS function, for example.

I've never seen that. That example in particular is definitely fragile and scary as you say. The Closure Compiler depending on "run-time user-agent sniffing in navigator to choose how to implement features" seems pretty broken to me... not sure why the maintainers haven't fixed that "small detail".

Fifteen minutes to build six million lines of code is amazing — you're right. With a comparison like that, it makes sense that 70 seconds is more than a bit much unless you're e.g. macroexpanding the world.

As to your footnote (I liked that you had a footnote by the way haha) I stood staunchly by dynamic typing for a long time until a few months ago when I finally got tired of the age-old process of looking at functions, wondering what precise shape the inputs take, looking at docstrings, analyzing the code, and experimenting in the REPL to find out "for sure". Over the course of my Clojure experience, that alone must have added up to hundreds or even thousands of hours in all. I love core.spec's approach, because it goes beyond static typing to provide guarantees that static typing, to my knowledge, can't (or at least, not without creating a combinatorial explosion of types). That said, something like Haskell's type system has always intrigued me, as finding bugs at compilation time is orders of magnitude better than finding them at runtime, or worse, in production.

All in all, I could never see myself switching away from Clojure(Script) to e.g. Rust because I value programmer productivity, macros, and to a lesser degree, syntactic purity, more than I do the latest trends of raw performance. Then again, I don't (yet) have as burdensome performance requirements as others might, and as you probably do. My hope is that I can write a function in one place (granting necessary rewrites for correctness or performance) and use it anywhere, in any language context (using source-to-source transpilation à la ClojureScript or bytecode-compilation à la Clojure) or environment (barring the more extreme ones like embedded, in which case all bets are off).

rnewman commented 7 years ago

seems pretty broken to me…

The CC maintainers are very change averse (each potential change gets tested internally for its Google app consumers before being approved), and CC's primary audience is multi-browser web JS, so I chalk this up to "ain't our use case", which is fair enough. But it does reinforce the feeling of not having the tools on our side!

That said, something like Haskell's type system has always intrigued me…

We've been very pleased with Swift for our iOS work; it has a very rich type system, and we've built some things with it that would have been very buggy or hard to navigate if written in JS.

Rust ain't Haskell — it chooses predictability, stack allocation (particularly of return types), and performance over expressiveness — but they're both worlds ahead of having no type checker at all.

metasoarous commented 7 years ago

One thing that stands out as suspect to me here: Are you really planning on compiling arbitrary datalog queries directly to SQL? My understanding is that datalog is strictly more expressive (particularly as relates to rules, recursion and such), and that this would therefor be impossible. Unless SQLite has some black magic I'm not aware of... Given that, it seams like you'd have to implement some components of a more classical datalog system, or only support a sort of "SQL compatible" subset of datalog. Assuming the former, I imagine you could translate whatever portions of a datalog query you could into SQL, then implement the rest in terms of those intermediate results/relations using standard techniques for executing datalog queries.

With that in mind, and the comments above on problems with transactionality and asynchrony, I'm a bit confused why something more analogous to the Datomic's model of evaluation wouldn't make more sense, and open up more room for generality at the storage level?

rnewman commented 7 years ago

All of the queries we've written so far are compiled into a single SQL query, yes. Modern SQL supports recursive CTEs, and SQLite supports user-defined functions, so we have a lot of flexibility… but we haven't even needed that yet.

and/or joins, filters, bindings, fulltext, and aggregates can all be translated into a non-recursive SQL query.

Some features, like pull expressions within :find queries, might be most easily implemented in a two-stage execution process. (Many simple pull expressions can also be implemented as a stream rollup on a single SQL query, of course.)

Other features like user-defined rules might require iterative or recursive query plans, using temporary tables or in-memory joins. We haven't got to that point yet.

The query executor is able to control transaction boundaries, and will have exclusive ownership of a connection, so SQLite takes care of isolation. Queries can be as asynchronous and iterative as we like, and the right result will drop out at the end.

Datomic's model is essentially: replicate index chunks over a network, and locally interrogate the index chunks to get answers. It's very memory intensive, and it's designed for the lowest common denominator of storage.

Doing so loses the principle advantage we have as an embedded data store, which is taking full advantage of SQLite and its excellent page cache, VFSes, and isolation. We absolutely could build a chunk-based system and write all of the stats, query planner, explainer, metaindex primitives, and index walks ourselves — it would take us longer to build, it would run queries more slowly, it would be buggy, and we would still use SQLite as a backing store. Perhaps one day we'll decide to spend the time on our own storage backend; I wouldn't bet on it.

(I spent several years implementing parts of AllegroGraph, which is a chunk-based tuple storage system, so I have some familiarity with this.)

ncalexan commented 7 years ago

On Thu, Jan 19, 2017 at 12:01 PM, Christopher Small < notifications@github.com> wrote:

One thing that stands out as suspect to me here: Are you really planning on compiling arbitrary datalog queries directly to SQL? My understanding is that datalog is strictly more expressive (particularly as relates to rules, recursion and such), and that this would therefor be impossible.

I cannot find a reference right now, but it is a theorem that Datalog with appropriate recursion restrictions is equivalent to SQL with certain recursion extensions. We can easily target these subsets and get what we need.

metasoarous commented 7 years ago

@rnewman @ncalexan Thank you both for the clarifications. I had never encountered SQL+recursion, but intuitively it makes sense that would do the trick.

I am saddened to see that things don't look so peachy for a durable, client side datomic/datascript/datomish/whatever clone coming out of Mozilla. But I'm hopeful and encouraged that your work is pushing the general model further into the mainstream.

Cheers