IPFS database: pubsub, consistency and persistence

pgte commented 7 years ago

This topic has probably been touched in various other notes, but I feel I need a specific place to address some of my concerns related to building an eventually-consistent key-value store on top of IPFS. I'm sure there is a huge body of knowledge out there on this subject, and I would like to gather some of it here. Also, I know some things about how how other databases try to solve this, but I'm more interested in how to do this using the primitives that IPFS already provides.

Partitioning and scaling

Instead of a global store, approaches like OrbitDB split a database into "tables" or "partitions" (I'll just call it "partition" from now on, as table reminds me of relational DBs too much). This model allows a database to scale, and keep all the messages related to that partition only to the nodes that are interested in it.

Assuming that each operation is appended to a log, and that this log is saved onto IPFS. The hash of the latest head of the log is then broadcasted using a pubsub channel that is specific to that partition.

Each node that is interested in a given partition subscribes to updates on that partition, and then it keeps receiving updates on the latest known head. When a node gets an update, it then uses IPFS to retrieve the content of the head (and parents) until it has all the operation log data that is needed to apply to the database. (Conflicts may arise and each node must be able to resolve them in a deterministic way, but I think this may be a discussion to have in the realm of CRDTs and related topics).

Unreliable message delivery

But, as we know, pubsub does not have reliable delivery, which means that messages can be lost. Poorly connected nodes and new nodes need to have a way to query what is the latest head of a given partition. This can be a problem, which can be circumvented in several ways, and I enumerate a few:

If there is enough operations on that partition, these pubsub messages pertaining to a given partition will have enough frequency that all interested nodes will eventually get a message
Instead of one broadcast once there is a new operation, if nodes that participate in a partition keep broadcasting the head (at least while they think the network has interest in that partition), all interested nodes will eventually get the latest head.

I think that, for some use cases, scenario 1 may be too weak, since it's bound to database activity.

Scenario 2 provides more consistency and persistence guarantees, but at the expense of network activity, which is, if not carefully designed, increases linearly with the number of nodes. (even though it is limited to the nodes that participate in that partition). Some mechanisms to soften this would be to:

a) broadcast when a new new node is added to the topic b) back-off the frequency of broadcasts once the activity stops c) broadcast less frequently as the number of interested nodes increase

Is this a valid concern? If so, what are your thoughts or experience on these?

// pinging @diasdavid @haadcode @gritzko

daviddias commented 7 years ago

Thank you for writing these notes @pgte! Really happy to have you working on these parts of the stack.

All concerns are valid, the way I would like to approach this is to identify the requirements of for the use-cases we want to support and then understand what are the parts that need to be built to create such a system and what can we use today (or quickly ship) that would mitigate some of the problems (i.e unreliable messages), always with the distributed nature in mind (no patching by putting a centralised DB somewhere).

Another question is also: What can be built without attaching new protocols to IPFS directly (i.e libp2p.handle(<new proto>)) so that we make this system rely on standard IPFS primitives. As an example, finding what is the head could be done through IPNS or simply by mounting some protocol on top of libp2p that offered an RPC call to ask neighbor nodes what is the head, this way when a node join it can always fetch the head.

Note, this also goes over the work happening at ipfs-cluster.

I believe what we need is:

Standard DB primitives
Multiple writers support
Support for complete network reboot (when every node that is writing on the log goes offline and so they have to recover from things like DHT/IPNS)

pgte commented 7 years ago

Use cases

By "IPFS database" I mean: a universal key-value store (supporting the LevelDown interface), (where each value is opaque and atomic), homogenous, fully distributed and eventually consistent, supporting multiple concurrent writers, where the storage is backed by IPFS.

DHT/IPNS

I don know much about IPNS, but I've seen performance concerns about it not being usable for supporting head propagation (source) - please correct me if I'm wrong. (If this is the case, I assume we'd have to fall back to pubsub or a custom protocol on top of libp2p.)

Perhaps IPNS could be used for a complete reboot, but I have this concern (not sure if valid): how could concurrency be handled in IPNS or DHT? Could we build some way of managing multiple writers in a reliable and deterministic way? (Perhaps using it to store a vector clock instead of just the head?..)

libp2p or custom protocol

A pubsub broadcast is surely simpler than a custom protocol, but I believe a custom protocol could be much more optimised for traffic, latency and convergence. I would like to explore this a bit more. I'm almost sure that there may already be some protocol out there that supports clusters with high churn. (Perhaps this is where it overlays some concerns with ipfs-cluster? Or not..)

Anyway, as you said, I'm more interested if and how we could overlay this on top of IPFS as it is today (at least as a starting point), which would remit us into pubsub.

ipfs / notes

IPFS database: pubsub, consistency and persistence #244

Partitioning and scaling

Unreliable message delivery