Design attempt - Githubissues

seriousme commented 9 years ago

First try on a design:

loki 2 0

The idea is that:

The Collections

Maintain state
Have indexes attached
Accept updates
Answer simple queries (e.g. give me record with ID x, which record id's have prop Y)
Emit updates to other update-consumers
Allow for full persistence ( Collection.save, Collection.load)

The query engine

does the smart magic of planning the query , knowing which indexes can be used etc.
filters the results
does the sorting if required
provided the initial results to DV's
etc

Emitted updates can be used by an adapter to:

write them LogAppend style to disk or network
update Dynamic views
update DOM objects etc

Updates can be initiated by:

the application using Loki
an adapter listening to the network
an adapter reading LogAppend style records from disk

To avoid long loads during LogAppend restores an adapter could LogAppend to disk, start a new log , do a save of the collection and remove the old log (on success). This will ensure maximum robustness of disk based persistence.

Fire away ;-)

techfort commented 9 years ago

took me a while to digest it all but this seems like a very good start! My first question is if we should have (similarly to the Flux architecture) a Dispatcher component that digests internal collection events and publishes them. In other words, I'd like to decouple consumers from the Collection, possibly DVs are also going to be consumers?

Secondly, when you say that Collections answer simple queries, do you mean the query engine does a "pass through" to Collection to return a particular record?

Another element I would like discussed is caching, a component that works out frequent queries. In JS it is very easy to do caching:

function process(arg) {
  // some complex stuff
}

function processCached(arg) {
  var cache = {};
  return function () {
    if (cache[arg]) {
      return cache[arg]
    } else {
      cache[arg] = process(arg);
      return cache[arg];
    }
  }
}

the only adaptation would be to have a dirty flag on the collection data to flush all cache. opinions, suggestions and lapidation all accepted :)

obeliskos commented 9 years ago

I refrain from lapidation contests with Irishmen, too much quarry :)

Is a 'database' or orchestration level entity implied above concerning logs/adapters/syncs? Each collection typically has been isolated islands with little I/o obligations other than serialization. Since this is still early i'm not certain if this is intentional or not.

seriousme commented 9 years ago

An event dispatcher would be valuable imho as it can basically glues the logic together and its a well known paradigm. Otoh you could also code the event dispatch by hand and get even more perf. E.g: say you have an update. You could then: a) have an eventdispatcher which distributes the events to the various adapters or b) just write some code to do the update, take the result, call the adapter etc..

option a) is more clean (config over code) and easier to understand for non-insiders, option b) however is imho typically faster (no dispatcher overhead ;-))

I would only use cache if I really had to as it tends to complicate stuff.

seriousme commented 9 years ago

Nearly forgot ;-).

with simple queries I meant:

give me record 1
give me all records (or record ids) using a query that touches exactly 1 index. (e.g "name==joe" or "age>25")

The query engine would resolve more complex logic like "(&(name==joe)(age>25)) Here sits the smarts that decides:

which indexed to use
how to merge setsof record ids
when to stop using indexes and just evaluate a set of records
optimize the execution plan based on stats obtained from previous queries etc. Now the query engine should of course hide the difference between simple and complex queries from the app. But an app only using simple queries should imho not be required to include the smart query engine.

seriousme commented 9 years ago

Btw: It might also be convenient to define a list of principles/tools to use: e.g.:

Principles:

ES6 as the language
Aim for 100% test coverage

Tools:

gulp + browserify for building
Karma+ Jasmine as test runner
Instanbul for coverage testing
Xyz for benchmark testing
jsdoced for testing documentation
etc

(I've just used my imagination, you might want to list something totally different here ;-))

techfort commented 9 years ago

I agree with all of it (including the choice of framework, i'm a gulp/karma/istanbul fan) except for (partly) ES6. I find the whole node.js using unsupported versions of v8 and having to use flags for ES6 support completely stupid. In that respect i'm happy io.js was forked. But aside from the politics of that situation:

ES6 support is spotty
performance is sometimes better sometimes worse than ES5

The wisest thing at this point in time is probably an internal "policy", along the lines of:

asm.js to be written for all low-level code and full-array iteration
approve a subset of ES6 features which are stable and well supported: --1. generators --2. let & const --3. destructuring --4. fat arrows

The jury is still out on class, import and export (I like the require syntax and if we're using browserify i'm not sure we need import/export), Symbol (support is meh). All thoughts on this are welcome! For tidiness though I'll open another issue #7 to make sure we can keep the conversation strictly on an architectural /design level in this thread. @seriousme thanks for the clarification on the query.

seriousme commented 9 years ago

I agree ES6 support is spotty. Therefore I would try to use Babel to tackle that one. Anything that needs polyfills is probably not suitable for now.

Wrt low level code: I would focus on getting a working prototype first and then see if additional optimization is worth it. My guess : no ;-).

A smart algo always outperforms any code optimization :-) E.g. try beating a hashmap lookup by doing ASM.js ;-)

But then again you might surprise me :-)

techfort commented 9 years ago

@seriousme yeah on point about asm.js - I should have clarified I meant for optimizations not for a very nerdy form of masochism. Will comment on Babel in #7

seriousme commented 9 years ago

Another bit of inspiration :-) http://www.slideshare.net/wiredtiger/

Which is the engine under MongoDB 3.0

Some notable points: and

And since its open source, its possible to peek under the hood to figure out how they do it: https://github.com/wiredtiger/wiredtiger

First glance:

btree
bloom filters
async

(and a whole lot of other stuff ;-))

The team that created this also created BerkelyDB.

techfort commented 9 years ago

@seriousme This is quite cool, I never heard of bloom filters so i'm off reading about them! At some point we're going to have to decide how simple (or complex) we want things to be, but it's good to gather all ideas before typing a single character of code!

ArnoBuschmann commented 9 years ago

@techfort not going to participate in discussing details but dropping another link for your idea gathering process ;)

http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/

As the "LogAppend style" was mentioned here already, I guess you guys already know about the link/concepts anyway but with a little chance there's still something of value.

Actually I wonder - and this can be considered a question to you - how tools such as Kafka and Samza fit into the picture. Would it make sense to combine them with Loki or probably just borrow ideas from them?

techfort commented 9 years ago

@ArnoBuschmann thanks for the suggestion, and feel free to participate - this is an open discussion for everybody. I admit to always have dismissed Kafka as a slower alternative to ZeroMQ, but now that I look at some docs I see this very interesting statement: "Apache Kafka is publish-subscribe messaging rethought as a distributed commit log." This certainly sounds like something that Loki could very much benefit from. In full Loki v2 philosophy it would be exceptional if there was a messaging adapter that would support various systems, but I know too little of Kafka at the moment so I guess the first step is to educate myself on the subject.

seriousme commented 9 years ago

The style of drawing in the article reminded me of another post on the same blog: http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/

A database is just a aggregated view of your event stream, which fits nicely with the idea of the collection as an aggregate of change events.

ArnoBuschmann commented 9 years ago

@seriousme yep, both presentations were made by Martin Kleppmann indeed.

In this video Kleppmann describes, how Samza takes the Kafka output and acts as stream processor to create such collections of change events and also how they get merged/enriched:

https://www.youtube.com/watch?v=yO3SBU6vVKA

Am I right to guess that what Samza is doing, could be accomplished by Loki as well?

seriousme commented 9 years ago

@ArnoBuschmann comparing Loki to Samza might be a bit ambitious, but the idea is indeed that ,as with any database, change events lead to a persistent aggregate. Now, if you copy that aggregate by reprocessing events or by another, more batch oriented, db replication mechanism is conceptually the same. The difference is only in timing.

techfort commented 9 years ago

@ArnoBuschmann @seriousme i watched Kleppmann's presentation and I am very impressed with the concepts, so thanks for sharing In my opinion the current state of Loki defines it as a client side database (despite my ambitions). My ultimate goal would be for Loki to be used in client applications that keep a completely distributed database going. In fairness, on the server-side, there is no point in trying to get a JS database to compete with MongoDB for performance, stability and maturity.

Theoretically this distributed architecture would allow the existence of a database without a single storage point (an extremely fragile scenario btw), it would also make it impossible for someone to "join" in the network because they have no log to pick up and sync their local data.

So: we need a Log. But I wonder, should this not be a standalone server-side application, that may well be using Kafka/Samza, effectively a product that only needs to exist in a distributed network of Loki clients, and doesn't necessarily need to be written, it could just be a plain Hadoop / Samza type-of application.

Having said all that, a couple of interesting ideas are emerging: Loki clients should be topic producers and consumers, with each collection changes being a stream produced and sent to a stream-processing application, and each client receiving a processed stream (why processed? because if a single object in a collection changed 3 times since the last time you checked, you only need the latest version of that object, which could be an aggregate of several events on the stream).

So we could make Loki strean-format agnostic, and implement different StreamProcessingAdapters (just to cater for different replication mechanisms - with the default being a plain stream of ordered changes to be applied to the local version of the db to sync with the rest of the network).

If anybody wants to jump aboard / create this Loki Stream Processing server component i'm all ears :) But it should probably be separated from LokiJS, what do you think @seriousme @ArnoBuschmann @obeliskos

seriousme commented 9 years ago

Having a memory only DB might sound scary, but as long as you have enough copies of the data running its just as safe as having a DB persisted to disk, and as a bonus you get "always on" as well :-) A number of in-memory DB's have this model where disk based persistence is optional or even absent (e.g. https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud ).

I agree with your statement about competing server side with MongoDB for performance, stability and maturity.

The challenge with event based replication is with consumers that start listening mid flight. E.g. you already have an event stream running for months and suddenly you decide to add an extra consumer. Now there are 2 options: a) the new consumer is initialized by replaying all events since the epoch b) the consumer is initialized with a recent copy of the aggregate (= consolidated events) and then fed with a stream of events that occurred after the creation of the aggregate. (the state of the aggregate could be transmitted as a stream of update events as well :-)).

If there is any significant percentage of update-on-update events then option B will reduce the amount of events to be processed. (btw: this is how most Databases I know replicate if they have continuous replication ;-))

The tricky part of the "processed stream" mentioned by @techfort above is that in a stream based world you don't come back to "check". You just swallow events.

So if Loki would be an event database then as soon as you subscribe to an event stream Loki should start to stream you the aggregate (like the DV queries the current state) and after that Loki should just continue to pass you the (filtered) events. (unlike the DV that updates its own aggregate). The consumer (which might be another Loki event database, locally or on the other side of the planet) would then use this event stream to do whatever it needs to do with it. (update the DOM, update its internal aggregate, feed other consumers behind it, switch lights, make coffee, whatever ;-))

techfort commented 9 years ago

@seriousme how about this (not saying I'm innovating anything here, it probably has all already been done and done better):

each event has structure: { id, timestamp, data} with data being some representation of the change, like the current change API, each event stored in the stream (on server)
existing clients consume events, applying last-write-wins logic
server nodes produce snapshots of processed data (basically a downloadable data set)
new client joins in, gets the latest snapshot, then picks up the stream from the timestamp of the snapshot (which should be a relatively small amount of data)

Is this crazy?

Another crazy idea that I'm thinking of is this: Loki could work as an in-memory interface for an underlying MongoDB database. That way, you could have DynamicViews, Changes API and all other Loki goodies (including in-memory speed) with the resilience and performance to disk of MongoDB. I suppose in-memory data would be only MRU data or similar (can't realistically load a MongoDB in memory unless it's tiny, in which case you don't need MongoDB :D ) Is this actually madness?

seriousme commented 9 years ago

@techfort : first 4 points sounds logical to me :-) For the second part of your post: I can imagine Loki to be a javascript based "filtered replica" of a mongoDB. e.g.

1) the browser app does a query on Loki. 2) Loki passes on the query to MongoDB, 3) Loki stores the result in a local Loki collection 4) Loki watches the MongoDB oplog for updates. 5) As soon as updates appear Loki filters them (like currently done with DV) and updates its local datastore.

From there on, Loki could:

emit change events to the browser app

or

update the associated DV, which is then viewed by the browser app.

The mechanism could be setup in such a way that one could make this work for Mongo, Couch, Redis etc by supplying a relevant adapter.

This way Loki gets the role where it can shine: its lightweight, easy to use and plays nicely with the big kids :-)

techfort commented 9 years ago

Ok 1, i'm loving this last part :) 3 questions.

am i right in saying that LokiJS's architecture could be unchanged if anything extended to delegate persistence to an underlying disk database through an adapter?
We still need streams and stream processing to ensure client syncing, but would I be correct in thinking that this and the DatabaseAdapter are components that would only pertain back-end use of LokiJS so the client version can do without this?
most important question probably: how much data (and which subset) of the underlying database should we be storing in Loki? I'm a bit fuzzy on that..

The discussion is really interesting but I'm not losing sight of what Loki is and does, so I want to make sure to have a minimal core that fully complies with the current spirit of LokiJS, and ship everything else as optional modules.

seriousme commented 9 years ago

My PoV: 1) yes, however you could argue that for a 2.0 the DV's could be called through an adapter (making Loki even more lightweight/faster for those who do not use DV's ;-)), but that's a choice. 2) yes, as long as loki offers a way to pick-up change events (e.g. the changes api) and a way to process updates (the current create/update/delete methods) it would be perfectly possible to make this work 3) that is up to the writer of the adapter :-) As long as Loki stores an Id and a Rev its always possible to reconcile with any backend database. (btw: Id and Rev generation might need to be pluggable as well for that to work as different DB's seem to have different algorithms for that).

Btw: one could also argue that the whole backend comms should be part of app using Loki and Loki itself only facilitates the CRUD and Query stuff. It all depends on the ambitions on Loki ;-)

techfort commented 9 years ago

Regarding your last point: I believe providing Loki adapters to other architectural components is what will make people fire up a VM and give it a go. They'll see it works and go "holy _!" with _ being a variety of english 4-letter swear words :) Friendly API, lightweight, fast, are the mantras.

techfort commented 9 years ago

@ArnoBuschmann thanks again for that link about apache samza . It is uncanny how LokiJS adopted so many of the patterns explained, and without prior knowledge of this. I believe Loki may well develop into a node.js-ecosystem equivalent or at the very least similar product, acting as a fast access layer of materialized views on top of a replicated node of a traditional db.

ArnoBuschmann commented 9 years ago

Hey guys, you were busy and I just read my way through. I like the design and I see so much potential for Loki to develop with the ideas discussed.

Developing from what is already there in well defined steps towards an enhanced system and keeping dependencies (databases, stream processors etc.) decoupled really is the way. Improve functionality by adding whatever system (Mongo, Kafka/Samza...) with adapters but be able to use Loki also without.

Concerning adapters I think two things would be beneficial:

A very precise documentation describing that Loki can cover different use case scenarios from being something like a minimal "stand alone" client db (as it is already now) up to an enhanced, replicated structure including db such as Mongo. I think even the current documentation has space for improvements, so investing a good amount of love into an awesome Loki 2 documentation would pay off nicely.
From my perspective as "consumer" or "user" (not developer) of Loki I feel that, as far as possible, adapters should be ready to use. Makes a big difference to read "We made an adapter for Mongo, but you can write your own for Couch or Redis." or "We have a Mongo adapter ready, working on a Couch one atm, then Redis is next." It's of course still possible to ask for help: "Missing an adapter we didn't provide yet? Let us talk about how we can get it done together."

My point is to implement an "easy to get started strategy" and help new users with a proper documentation, tutorials and blogposts (can be made by others but having a place to link them). It's common, that developers want to write code and "don't have time" for the documentation, escpecially as a good documentation requires a lot of additional thought and work. It's important to have people on board who kind of like to care on explaining things. @jrhicks already did a great job with his blogposts :)

@techfort Yes, it's uncanny but it's "in the air" I guess :) points at the multiple discovery hypothesis -> http://en.wikipedia.org/wiki/Multiple_discovery

ahdinosaur commented 9 years ago

this discussion is great. :smile_cat:

I believe Loki may well develop into a node.js-ecosystem equivalent or at the very least similar product, acting as a fast access layer of materialized views on top of a replicated node of a traditional db.

:+1:, even better if it integrates well with the level ecosystem.

seriousme commented 9 years ago

And another source of inspiration: https://github.com/bevry/query-engine It does not seem tot have indexes, is written in coffeescript, but the demo's and the docs look quite OK.

ArnoBuschmann commented 9 years ago

@techfort Multiple discovery hypothesis, part two -> You opened this issue https://github.com/techfort/LokiJS/issues/109#issuecomment-88466572 with the words

Create a flux store that utilises LokiJS.

After we discussed design ideas for Loki 2 in this thread, today I remembered, that Pete Hunt speaks in this video about how he calls it "full stack flux" for React: https://www.youtube.com/watch?v=KtmjkCuV-EU&list=PLb0IAmt7-GS1cbw4qonlQztYV1TAW0sCr&index=8 And guess what? Everything is EXACTLY the same again as we diskussed it here.

What Pete Hunt actually didn't mentioned ist the availability of a replicated local DB and that is exactly Lokis niche to step in. Previously you told me, that my idea of saying Loki can make isomorphism easy and one could render, no matter if on the server or the client, sounds futuristic, but the more I think about it, I'd say it should work elegantly like this:

check the environment (node or browser resp. window)
If node, fetch aggregated data from the server, if browser, render with the local DB replica.

Sounds like a huge step to me.

techfort commented 9 years ago

@ArnoBuschmann yep this sounds great, thanks for the link which i'm going to check out immediately. I am thinking (but it's only a thought) to either force the user to declare the environment (at the moment there is a config option env which takes NODEJS | CORDOVA | BROWSER ) or to do different builds for node.js-based environments (eg. node.js, NW.js, cordova) and browser. And yes, from there on, it would either fetch or render. Automatic environment detection gave us a few headaches and i'd rather have a more robust approach in v2, even if it means 1 more line of code for the developers.

@ahdinosaur could you elaborate a bit more on the leveldb ecosystem, how do you see Loki and level integrating?

hampsterx commented 9 years ago

This thread (and the proceeding one) is genius~

My use case is simply a central db (lokijs - nodejs (there ya go @techfort) that I want to realtime sync to the client (reactjs). I was just pondering the sync issue (and yeah I read Kleppmann's stuff recently too) which made a lot of sense. Currently I have it crudely working by serializing the data into the page and then doing updates on channel events via websocket. Even this is not fullproof as there is a small window before connection where you could miss messages. A log approach sounds tempting~

Couple of random links

https://www.firebase.com - I presume everybody is aware of these guys. I haven't seen an open source clone of this yet, I guess lokiJS might be the first :)

https://news.ycombinator.com/item?id=9328006 - mesh.js (previously crudlet.js)

Has an adapter for lokiJS. Seems more like it's suited for remote api / local data cache store scenario rather than for realtime sync but this comment is interesting..

"It would be awesome if it persisted all the operations to a log. That way, when I attach a new endpoint I could get it "caught up"."

hehe

ArnoBuschmann commented 9 years ago

@hampsterx Nice, mesh.js looks interesting!

@techfort As the project name changed from crudlet to mesh, you might want to change the naming for the Loki adapter? I created a pull request for the readme, but be careful ;) and check it as this is the first Github pull request I ever did, tehe.

https://github.com/ArnoBuschmann/mesh-loki/compare/master...ArnoBuschmann-patch-1?quick_pull=1

techfort commented 9 years ago

hwy @ArnoBuschmann thanks for that PR - everything looks good - however i'm not the owner of mesh-loki, mojo-js is :) that's to say - it's up to mojo-js to merge it. @hampsterx thanks for mentioning mesh and mesh-loki. I found mesh very early on (as in - i was stargazer number 25 or thereabouts) and asked Craig (@crcn ) to create a mesh-loki adapter, the guy is so fast i'm not sure i was finished asking for it and it was done!

And as for you use-case: that's precisely what i'm trying to address, so any ideas on the subject are going to be more than welcome! So far, here's what is forming in my mind:

loki will retain its collections structure
but also that the data is most likely going to be a stream of CRUD events (yes - we could even store Read events for some machine learning and intelligent caching)
DVs may evolve into db-wide views by filtering the whole stream as well as single collections? (e.g. all records newer than )
periodic snapshotting will generate a state of the data at a certain point-in-time, new clients can retrieve that and catch up from there on the events stream
persistence may be handled directly through a Loki Persistence Adapter or delegated to an underlying db (mongo, couch, even an RDBMS if you have the patience to map js obejcts to sql queries)
events will have a lampot-timestamp which will determine the client and the timestamp of the even creation, and will be propagated to all other clients. Although this particular pub-sub will probably be an external component.
the structure will be extremely modular, with a standard Loki build being a default combination of certain collections/persistenceadapters/index/queryengine
all local databases will subliminally play a random cult underground 80s US heavy metal album defaulting to this

I don't like lists with more than 7 elements so any element from this on will be number 7.

seriousme commented 9 years ago

Nice article on the subject can be found here: http://www.benstopford.com/2015/04/07/upside-down-databases-bridging-the-operational-and-analytic-worlds-with-streams/

TL;DR: Looking at the pros and cons of externalising caches, indexes, materialised views and asynchronous streams of state.

Most optimal solution seems to be: A synchronous writeable view at the front. A range of different read-only views at the back, running asynchronous to one another. An event stream tying it all together with a single journal of state. Side effect free functions that (re)generate different views from the stream. A spout for programs to listen and interact. All wrapped up in a single data platform. A single joined up unit.

techfort commented 9 years ago

I am very intrigued by this approach., and it seems to me that a single product covering the entire stack is entirely feasible. Whether that's Loki or something much bigger and ambitious it is open to discussion, but - as far as i'm concerned - i'm even up for that challenge. At least i'd get to use a strongly-typed language for that (with my preference being c++ :D)

I think your last summary explains it all, it is pretty much what we agreed upon so far, so a designed has definitely emerged in my view. Again, picking up from where you left, I'd re-iterate the importance of functional programming, not only from a point of view of software written giving importance to higher order functions, but also stressing the side-effect-free philosophy.

Based on these, a Loki(JS) can be designed and developed now, with the details of the implementation of each module to be discussed in separate threads. Not that there is any rush, obviously. @seriousme @obeliskos what do you think?

retorquere commented 9 years ago

Dropping in late here, but if es6 features are deemed desirable, but ill supported, there's always coffeescript.

techfort / skrymir

Design attempt #6