Open nicola opened 8 years ago
Excellent work putting all this together @nicola! Will take time to read this through more carefully.
I might be jumping the gun here, and this might not be the right place to discuss particular details, but couple of things that caught my attention. And I've seen references to this in prior discussion as well as in the use cases: pubsub vs. message queues. I think the destinction needs to be made clear.
Define the need for snapshot - say that one peer has missed an entire year of updates, they can either download each individual update or get a final snapshot (maybe too complex but could be done with IPLD transforms)
That sounds a lot like a log, is that the case? Does the publisher need to know about your "history" to decide which messages to send to you? Does the subscriber need to know about their previous history with a topic?
Define the properties of this final layer (retrying subscription, ensure order, keeping a state of messages received & so on)
ensure order and keeping a state of messages received sound like message queue behaviour. Or am I talking about the wrong layer here? Do we see order and history to be desired properties of the pubsub design?
@nicola what are your thoughts on this? can you clarify and get me up to date?
Do we see order and history to be desired properties of the pubsub design?
They are part of the desired "authenticated streams" pub/sub design. Not the bare-bones byte sequence broadcast. So "maybe", depending on how these layers shake out.
The point that many brought up during the workshop and around is that establishing the authentication invariant makes several pub/sub mechanisms much easier to think about and implement.
(I am not sure if this is the best place to comment or some other issue, but I looked around and it seems this is the most design-oriented issue.)
I have two things I would like to add here. One is from my experience with Meteor pub/sub, their migration to Apollo stack, and my experience with various data-flow systems I have worked on in the past. I think in general the question with pub/sub is: do you want to send over it only information that something changed and then the other subscriber can fetch it if they want, or are you sending the whole stuff immediately. Lately, I think the first approach is better because it allows you also to ignore some updates or batch them together, while on the second case you cannot really do that because that is already at your place. In some way, one question to ask is about rate limiting control, what if there are too many changes you (or peer on your path) cannot handle. What is backpressure story here?
The other is a use case we are developing. It is blockchain on top of IPFS and current approach is:
/ipns/<id>/transactions
for list of pending transactions as seen from peer with ID id
; we currently simulate subscription by regularly polling the IPNS endpoint for each peerThis last point I think is something where it would be great to have a pub/sub. So that we could say: please inform us of any change to that IPNS entry as it is happening. For us it would be OK that for pub/sub to work we would have to directly connect to that peer who owns the IPNS entry, because we care only about peers who are online at that moment. But it would be great if maybe that would be done automatically by IPFS. So that if you subscribe to some IPNS entry, IPFS automatically tries to optimize that with a direct connection.
Speaking of stack layers. I believe that static data, Merkle structures and pub/sub can be all unified as a stream of IPLD frames.
By the way. From what I understand, IPLD does not have header+content frames. It is either-or. Am I right? Are JSON IPLD frames supposed to reference blob IPLD frames? For example, git objects have both (a header and content).
In case our primitive is a header+content frame, a partially ordered stream of frames is the waist of the stack. Pub/sub is just one use case.
Then, the core protocol is 3 messages (send/sub/unsub). No matter, whether we fetch a static file or subscribe to a topic. An intermediary node may not care at all whether it relays a video file or a real-time document editing session. Hash checks may work slightly differently of course, but the general action sequence is exactly the same:
(where 1-4 repeat in a loop)
A position in a stream can be expressed by one-or-many hashes (cause of partial order). Alternatively, by an offset. (That gets complicated with partially ordered streams, but it is well-defined anyway.)
Obviously, implementations may differ. DHT, HTTP, anything.
Stream snapshotting is definitely a different layer of the stack, cause it depends on data semantics.
Stream snapshotting may be a different layer of the stack, but retrieving past state seems like a crucial part in many applications. I'm thinking for example of the "Facebook like" example. When a page that implements this opens, it wants to retrieve the current (usually short) list, and then subscribe to receive notifications. of updates.
Note this also brings up another level of scaling to think about - not just 1000's of browsers watching one stream but also very large numbers of small streams.
Edit: moved to it's own issue: https://github.com/libp2p/research-pubsub/issues/14
The latest pubsub work is in libp2p/specs#73 -- gossipsub, the successor to floodsub
The aim is to collect use cases and requirements for a minimum viable pubsub v1.
Use cases
Requirements
PubSub layers
This layer is the final interface that should be used by application developers (a-la orbit)
.sub(id, [set of keys in the network])
.pub(id, content)
.on('message', (content) =>)
snapshot
- say that one peer has missed an entire year of updates, they can either download each individual update or get a final snapshot (maybe too complex but could be done with IPLD transforms)(layer 2) Data model
This is layer is the data model that describes a sequence of events
(layer 1 and 0) Pubsub Routing (better name needed)
This consists of peer discovery and event propagation (same as content-distribution).
This is the layer that routes an event from publisher (that has only a partial view of the subscribers - otherwise this would just this broadcast!) to the subscribers. There are different strategies for distributing events. If this layer change, there should be no change in the Data Model or Developers API.
Structured vs Unstructured networks
There are different strategies to do this:
Questions to answer:
The role of the peers
The author of the content is a publisher, however they have a partial view of the subscribers, so some of those subscribers must be distributing the content themselves (becoming brokers themselves - since the content is authenticated, they could be considered publisher themselves too, they are just publishing someelse content).
Pubsub router interface
This should define a low level interface where subscribers and publishers talk individually to other peers, for example exchanging peers, exchanging interests, basically a generic pubsub protocol. This can be helpful in case we are implementing different pubsub routers.
Testing the network
Note
cc @jbenet, @diasdavid, @haadcode, @dignifiedquire