(WIP) Kappa5 - Githubissues

Frando commented 4 years ago

This PR pulls in the current state of my kappa5 branch. It has been talked about a bit already: It changes kappa-core to be dependency-free (it just connects sources to views) which should make it much easier to support flexible indexing flows and scenarios. And then it includes sources for hypercore and multifeed (and hyperdrives!).

This is not completely ready yet i think, but I wanted to put it up for review and discussion, and to agree on the best way on.

README with the new API

Open questions and missing features

State handling. In current kappa, the views need to provide storeState/fetchState handlers if they want to persist the indexer's state. I don't think this should stay with the view, as now a view can have many sources. There's two parts of state here: 1) the view version to trigger a rebuild on a version change 2) the indexing progress. For 2) a more complex (sparse) indexer/source would track the progress on its own (eg in bitfields), a simple source could still make use of a buffer to store its state.

So my current thinking is: Have the kappa track the state for view versions, and a buffer per flow (source instance-view combo), by default in-memory. And allow to supply storeState (key, state, cb), fetchState (key, cb) opts to the kappa-core. We could then also ship eg a simple implementation with tinybox, then only a random-access-storage instance would have to be passed into the kappa for persistence.
Naming: do people like the source term here? I was thinking if indexer would be better, but am not sure. Like, the createSource function does create a source for the kappa, which usually is a function that indexes a set of feeds or other datastructures. So creating a source for the kappa does usually not create datastructures, but only the function that indexes them.
Backwards compatibility: Currently there is the kappaClassic function in index.js that wires the new kappa-core together in an API-compatible way to the current kappa-core. I mostly did this for testing - it passes the cabal-core tests. However, this is based on current multifeed, which means hypercore 7. So actually, I'd propose to not have that, and have a backwards incompatible change.
What to include in kappa-core? Should kappa-core be just the kappa-core, or also include a set of useful sources? (the modules in /sources)

hackergrrl commented 4 years ago

This is super cool @Frando. I'm still thinking through everything, but here are some initial questions:

How important is supporting multiple sources flowing to a view? I'd be interested to hear about specific usecases you have in mind. If it's something that's not strictly needed I think it could simplify the code a fair bit.
This new setup assumes that sources will store their own state (what they've indexed / not indexed). This would have to happen in their pull() implementation. However, this is a bit tricksy, since the source would only be able to store state during the part of control flow where the view has not yet indexed the messages. This makes it hard for a source to update its state right after the view processes a batch. I think it'd be cool to see an example of a source that uses disk storage, to look at together & think through the implications of this.
Since we could imagine certain sources wanting to manage their own state in a special way (like a sparse indexer using a bitfield-db), maybe the pull() api can just be pull(next), and we assume it manages itself.
I like the idea of having a source manage both its own state and its version info. That way fewer storages need to be specified. Actually then the kappa-core instance wouldn't store any state! So you could do, say:

var kappa = require('kappa-core')
var hsource = require('kappa-source-hypercore')
var tinybox = require('tinybox')
var raf = require('random-access-file')
var bkdview = require('kappa-view-bkd')
var level = require('level')

var core = kappa()
var src = hsource(tinybox(raf))  // stores `version` and `state` in a random-access-* store
var view = bkdview(level('./foo'))  // store just the spatial database details
core.use('spatial', src, view)  // hooks up the source and view instances

core.api.spatial.query([-40,40,-80,80], (err, res) => { /*...*/ })

hackergrrl commented 4 years ago

btw, kappa-core is on hypercore-protocol@7 now! :guitar:

Frando commented 4 years ago

So, now in some more words. I agree to noffle's remarks!

multiple sources: I do need them or think they are something for which a proper abstraction would be good to have. Like multifeed-index actually is a set of hypercore sources. But as long as it can be cleanly done in a multisource module, its all fine
adding a callback to signal back to the source that the view has completed indexing is good yes
and also removing the state storage from kappa and moving it towards the sources. I might want to add a SimpleStatefulSource or such to not have to rewrite the same code (similar to kappa-view-level)

I started to update the Kappa5 based on these observations. Before I continue I think I'd like for us to agree on the end result so that its not too much work rewriting things again.

Currently, a most simple example would look like this:

https://gist.github.com/Frando/21bc9e796544692b51de7e85edd1983a

Things to note:

Each use call creates a Flow, which is the combination of source + view. This makes things explicit, which is good. Its up to the consumer to create a source many times if it needs the same source for many views (thats how it always was, just happening inside kappa-core). I think I like this, and this also opens the door to possibly optimize for "one source for many views" scenarios
Both views and sources can expose an api. The view's api is mounted on kappa.api, the source's api on kappa.api.source. Is this good, or should this be structured differently?
One thing I'm still not totally sure is how the source can talk to its flow to request that pull be called again. Right now I pass the flow object into to open method, where it can then be stored somewhere, so that when the source has incoming messages, it can call flow.update to signal that its pull method should be called. Before (in the current kappa5 branch), it was passed into the constructor (there, the createSource constructor is called by kappa-core, now a constructed source would be passed in by the consumer - which is nice because its the same as with views).

Frando commented 4 years ago

I started upating the API after the discussions.

See https://github.com/Frando/kappa-core/tree/kappa5-new for now. Most tests are updated and pass.

Frando commented 4 years ago

This is continued in #14.

kappa-db / kappa-core

(WIP) Kappa5 #13