libp2p / research-pubsub

Research on PubSub algorithms for libp2p
https://github.com/libp2p/specs/tree/master/pubsub
MIT License
75 stars 6 forks source link

Designing PubSub #9

Open nicola opened 8 years ago

nicola commented 8 years ago

Note: This issue summarizes the work done on PubSub by the IPFS community (https://github.com/libp2p/pubsub/issues/1 (8 Aug), Q3 team week, Q3 workshop https://github.com/ipfs/2016-Q3-Workshop/issues/17) and a brain dump from some of my personal notes that may not be representative of everyone's view and understanding of the problem space.

Background

  • Developers want to be able to have dynamic content on their applications
  • Peers should be able to subscribe to particular content and when other peers publish the matching query, the subscribers should receive an update
  • ipfs/notes#pubsub Extensive list of pubsub conversations

    Literature review

  • [ ] find a way to collaboratively extract value from lit-reviews
  • Academic
  • [x]([XL pubsub paper]%28http://dl.acm.org/citation.cfm?id=2543583%29) - get background on distributed pub/sub
  • [x] https://github.com/libp2p/pubsub/issues/4 - collect a list of relevant papers and find a place and time to discuss them
  • Industrial (understand what went wrong/great and why)
  • [ ] collect a list of cases (e.g. https://github.com/libp2p/pubsub/issues/3)

    Designing PubSub v1

The aim is to collect use cases and requirements for a minimum viable pubsub v1.

Use cases

Note: The following is my personal view of the problem space (from our conversations and use cases), please chip in/argue by commenting, I will update accordingly

Terminology

  • Publishers: they are the authors of the new content that need to be published
  • Brokers: they can be subscribers, they receive messages from publishers and send them to subscribers
  • Subscribers: simply subscribe to messages
  • [ ] https://github.com/libp2p/pubsub/issues/5 - Add new terms to ipfs/glossary

    Problem space

  • [ ] What type of content can users subscribe to?
  • Permissioned vs Permissionless pubsub
  • tl;dr Users must subscribe to the content where the pubkeys of the publishers are known. If not, they are basically blindly subscribing to spam.
  • Permissioned:
  • (in theory) subscribers subscribe to a known set of publishers (each publisher is known by a public key)
  • (in practice) subscribers subscribe either to an IPRS or an IPNS (known set of pubkeys), the publishers of the content publishes to the network that route event to the subscribers.
  • (verdict) a lot of work has been done in this direction!
  • note: this does not mean that the numbers of subscribers must be fixed, new users can join/leave a permissioned network
  • Permissionless:
  • (in theory) subscribers subscribe to an identifier (which can also be a query), however, in order to avoid complete spam (since we don't know the pubkeys everyone can publish at that identifiers) we need to add some way to reduce spam, e.g. proof-of-work type of approach
  • (in practice) subscribers could subscribe to a genesis block of a blockchain (say bitcoin) and receive updates from the majority
  • (verdict) this is mostly an exciting gray area, anti-spam is hard and this would just turn pubsub into consensus/atomic broadcast problem

    Layers of abstraction

    (layer 3) Developers API .pub, .sub and events message, error

This layer is the final interface that should be used by application developers (a-la orbit)

This is layer is the data model that describes a sequence of events

This consists of peer discovery and event propagation (same as content-distribution).

This is the layer that routes an event from publisher (that has only a partial view of the subscribers - otherwise this would just this broadcast!) to the subscribers. There are different strategies for distributing events. If this layer change, there should be no change in the Data Model or Developers API.

Structured vs Unstructured networks

There are different strategies to do this:

Questions to answer:

The author of the content is a publisher, however they have a partial view of the subscribers, so some of those subscribers must be distributing the content themselves (becoming brokers themselves - since the content is authenticated, they could be considered publisher themselves too, they are just publishing someelse content).

This should define a low level interface where subscribers and publishers talk individually to other peers, for example exchanging peers, exchanging interests, basically a generic pubsub protocol. This can be helpful in case we are implementing different pubsub routers.


Note

  1. I haven't covered many things that we left pending or that I don't remember right now, please chip in - nothing of the above is cast in stone :)
  2. We have to create and link the issues related to most of the TODO items, so that we can have separate conversations outside of this issue

cc @jbenet, @diasdavid, @haadcode, @dignifiedquire

haadcode commented 8 years ago

Excellent work putting all this together @nicola! Will take time to read this through more carefully.

I might be jumping the gun here, and this might not be the right place to discuss particular details, but couple of things that caught my attention. And I've seen references to this in prior discussion as well as in the use cases: pubsub vs. message queues. I think the destinction needs to be made clear.

Define the need for snapshot - say that one peer has missed an entire year of updates, they can either download each individual update or get a final snapshot (maybe too complex but could be done with IPLD transforms)

That sounds a lot like a log, is that the case? Does the publisher need to know about your "history" to decide which messages to send to you? Does the subscriber need to know about their previous history with a topic?

Define the properties of this final layer (retrying subscription, ensure order, keeping a state of messages received & so on)

ensure order and keeping a state of messages received sound like message queue behaviour. Or am I talking about the wrong layer here? Do we see order and history to be desired properties of the pubsub design?

@nicola what are your thoughts on this? can you clarify and get me up to date?

jbenet commented 8 years ago

Do we see order and history to be desired properties of the pubsub design?

They are part of the desired "authenticated streams" pub/sub design. Not the bare-bones byte sequence broadcast. So "maybe", depending on how these layers shake out.

The point that many brought up during the workshop and around is that establishing the authentication invariant makes several pub/sub mechanisms much easier to think about and implement.

mitar commented 8 years ago

(I am not sure if this is the best place to comment or some other issue, but I looked around and it seems this is the most design-oriented issue.)

I have two things I would like to add here. One is from my experience with Meteor pub/sub, their migration to Apollo stack, and my experience with various data-flow systems I have worked on in the past. I think in general the question with pub/sub is: do you want to send over it only information that something changed and then the other subscriber can fetch it if they want, or are you sending the whole stuff immediately. Lately, I think the first approach is better because it allows you also to ignore some updates or batch them together, while on the second case you cannot really do that because that is already at your place. In some way, one question to ask is about rate limiting control, what if there are too many changes you (or peer on your path) cannot handle. What is backpressure story here?

The other is a use case we are developing. It is blockchain on top of IPFS and current approach is:

This last point I think is something where it would be great to have a pub/sub. So that we could say: please inform us of any change to that IPNS entry as it is happening. For us it would be OK that for pub/sub to work we would have to directly connect to that peer who owns the IPNS entry, because we care only about peers who are online at that moment. But it would be great if maybe that would be done automatically by IPFS. So that if you subscribe to some IPNS entry, IPFS automatically tries to optimize that with a direct connection.

gritzko commented 8 years ago

Speaking of stack layers. I believe that static data, Merkle structures and pub/sub can be all unified as a stream of IPLD frames.

By the way. From what I understand, IPLD does not have header+content frames. It is either-or. Am I right? Are JSON IPLD frames supposed to reference blob IPLD frames? For example, git objects have both (a header and content).

In case our primitive is a header+content frame, a partially ordered stream of frames is the waist of the stack. Pub/sub is just one use case.

Then, the core protocol is 3 messages (send/sub/unsub). No matter, whether we fetch a static file or subscribe to a topic. An intermediary node may not care at all whether it relays a video file or a real-time document editing session. Hash checks may work slightly differently of course, but the general action sequence is exactly the same:

  1. subscribe
  2. receive
  3. verify
  4. cache?
  5. relay?
  6. unsubscribe.

(where 1-4 repeat in a loop)

A position in a stream can be expressed by one-or-many hashes (cause of partial order). Alternatively, by an offset. (That gets complicated with partially ordered streams, but it is well-defined anyway.)

Obviously, implementations may differ. DHT, HTTP, anything.

Stream snapshotting is definitely a different layer of the stack, cause it depends on data semantics.

mitra42 commented 7 years ago

Stream snapshotting may be a different layer of the stack, but retrieving past state seems like a crucial part in many applications. I'm thinking for example of the "Facebook like" example. When a page that implements this opens, it wants to retrieve the current (usually short) list, and then subscribe to receive notifications. of updates.

Note this also brings up another level of scaling to think about - not just 1000's of browsers watching one stream but also very large numbers of small streams.

MichaelMure commented 7 years ago

Edit: moved to it's own issue: https://github.com/libp2p/research-pubsub/issues/14

ghost commented 6 years ago

The latest pubsub work is in libp2p/specs#73 -- gossipsub, the successor to floodsub