pub/sub - publish / subscribe

jbenet commented 8 years ago

We've known for some time we need to layer a pub/sub system on IPFS. We should try to reuse other things. The least work, the better. But it should be a simple protocol, easy to implement, well-layered, and meshing well with the rest of IPFS abstractions.

Requirements

very, very fast
flexible (maybe different topology-forming algorithms)
multiple modalities (single publisher, multiple publishers, etc)
support both encrypted and unencrypted streams (encrypted again, this is above the regular libp2p encryption -- and specific to the pub/sub group)
support privately encrypted channels (ie user supplied keys)
layers over IPRS to do discovery

We need to:

[ ] do a survey of relevant {literature, protocols, and implementations}.
[ ] decide on a protocol
[ ] build it into libp2p.

I likely won't have time to do a proper survey until late Nov or Dec. If you'd like to speed this along, post links to great {papers, systems} here for me.

Relevant to research:

https://github.com/ipfs/ipfs/issues/73 and #ipfs IRC logs.
XMPP and matrix.org
MQTT and other messaging queues.
all the multicast research
all the pub/sub research

davidar commented 8 years ago

Cc #42

bharrisau commented 8 years ago

It might be easier start with a basic [slow] implementation before doing the high performance multicast P2P thing. For example, this paper (https://www.cs.utexas.edu/~yzhang/papers/mad-info09.pdf) has two modes depending on how active the group is.

The basic implementation would be as simple as:

Pick the IPNS address you want to subscribe to changes at
Append signed message with nodeID and TTL/expiry to DHT at IPNS address
Owner of IPNS address checks all subscriptions and sends a notifyIpnsUpdate message to each node

Encrypted streams with group-specific encryption, multicast and multiple publishers can then be deferred to the more advanced implementation.

bharrisau commented 8 years ago

I guess the basic could be made even simpler by using the backlink ideas (ipfs/ipfs#31) instead of a new interface to the routing. You then only need a new P2P message to notify a node of changes.

spikebike commented 8 years ago

I stumbled upon this "Decentralized Reddit using a DHT to store content and a blockchain to rank it" https://news.ycombinator.com/item?id=10391996

The first comment seems particularly relevent: liamzebedee 21 hours ago Re: the hosting of topics/subreddits in the DHT, I've done quite a lot of research [1] into a very innovative yet not well known P2P publish-subscribe network design [2] from some Norweigan computer scientists that removes the role of hosting for nodes not interested in a topic, even designing a decentralised microblogging platform on top of it [3].

[1] http://liamz.co/wp-content/uploads/2015/03/Computer-Science-Extended-Essay_Liam-Edwards-Playne.pdf [2] http://acropolis.cs.vu.nl/~spyros/www/papers/PolderCast.pdf [3] BitWeav http://liamz.co/wp-content/uploads/2015/03/whitepaper.pdf

I'm reading the papers, but it seems like some very interesting discussion for pubsub. In particular Poldercast's finding a subset of nodes with a particular interest and only those nodes host that interest reminds me of IPFS's policy of not downloading anything unless you ask for it.

I'll read the papers before I make any specific pub/sub recommendations.

jbenet commented 8 years ago

@bharrisau agreed on starting simple to get something working and moving to more efficient constructions later.

@spikebike [1] link is broken.

davidar commented 8 years ago

@jbenet it should be

[1] http://liamz.co/wp-content/uploads/2015/03/Computer-Science-Extended-Essay_Liam-Edwards-Playne.pdf

spikebike commented 8 years ago

@davidar @jbenet thanks, right, fixed.

rabble commented 8 years ago

The xmpp stuff proved really hard in comparison to federation doing pubsubhubub.

jbshirk commented 8 years ago

Maybe not new to folks here, but new to me: https://hackpad.com/Probabilistic-data-structures-7UPPH2soDvw

jbshirk commented 8 years ago

Now I see that discussion about the use of Bloom filters is already underway: https://github.com/ipfs/ipfs/issues/31#issuecomment-55875124

jbenet commented 8 years ago

See also http://www.w3.org/TR/push-api/ (HT @nicola)

KrishnaPG commented 8 years ago

Came across this (IPFS) while looking for pubsub + rpc over webtorrent. Out of the all existing pubsub and rpc protocols, the best one that comes close to practical use is WAMP / AutoBhan. It facilitates both pubsub and rpc over web (means browser to browser rpc + pub/sub) and light-weight enough to run on IOT devices (raspberry-pi etc.).

Its easy to setup and highly performant. However, the major problem with WAMP is: it is 'routed' / 'brokered'. If there is a way this WAMP can be integrated with webRTC/bittorrent (webTorrent), for P2P, then it would pave way for next generation IOT apps.

For PubSUB - here is one sample functionality what we are trying to achieve with IOT:

Imagine large file served through bittorrent (means, many chunks of the file are served from multiple sources)
Now, imagine these chunks are all updated by different sources / sensors independently (like rows in a databases)
Whenever a chunk is updated, all the readers connected to that chunk should be updated with that new data. Just like BitTorrent, except the connection between the file chunk and reader 'stays alive' (like comet / long-poll of http).

Similarly for RPC - here is one sample functionality that we are looking to achieve:

Imagine large scientific data file (once again spread across hosts as chunks served by, say bittorrent), each chunk containing large set of records
We should be able to run a function over all the records, but the data should not be moved across machines. Rather copies of the function should get executed as an RPC call over each host (with only part of the data local to that host), similar to HDFS + map-reduce.

Is it possible to achieve above kind of functionality with IPFS? If not yet, I would suggest to strongly consider integrating WAMP as pubsub+rpc part of the protocol (rather than reinventing the wheel). Autobahn implementations of WAMP comes with clients for many languages and performance metrics are also very good.

fazo96 commented 8 years ago

@KrishnaPG as soon as pub/sub is implemented, you should be able to do all that! You can use IPNS to expose each node's chunks. You can aggregate the data by having a list of the nodes on each node. For RPC calls, you can't do that using IPFS, but you can have the nodes download the code to execute from somewhere using IPNS, publish their results, then you'd have to aggregate them.

You can't really do the RPC stuff you described but your usecase would work really well using only IPFS for all your networking (assuming it has pub/sub implemented) you just need to implement that functionality in a different way.

Also keep in mind that IPFS has global deduplication :+1: But pub/sub is not implemented yet.

KrishnaPG commented 8 years ago

Thank you @fazo96 .

Yes, you are right - Doing RPC (moving the code to the data and executing it locally on each chunk) may require an additional layer on top of IPFS (which involves additional functionality that involves job queues, schedulers, retry mechanism and result aggregators).

If we look at it, fundamentally RPC is kind of opposite to the basic file-read or pub/sub (in terms of data flow direction).

For example, in a simple file-read/copy, data goes from the disk/chunk --> the client/reader. Whereas, in RPC the data (the code to be executed) has to go from the client --> the disk/chunk and get executed, and the results should either go back to the client (if the size is small) or get stored as additional files/chunks on the local machine (if the results data size is large).

This requirement to be able to create additional chunks/files on the host locally may need support from the base IPFS, though.

On the other-hand, there is another radical way to look at this.

That is, treating every operation (including file_read, copy, delete etc.) as RPC call, and allowing transformations over basic operations. For example, consider this

                Fn()
client ------------------> chunk

In a normal read operation, the Fn = get_me_chunk(), withget_me_chunk being the usual built-in file-read Op.

And when we need to execute, say do_something on the chunk it would become Fn = do_something(get_me_chunk()) with both functions getting executed locally on that chunk. The client would send the do_something code to the chunk and get back results as usual (or gets back the details of additional chunks created and stored as results).

Fn here can be thought of being similar to HTTP Verb (GET/PUT etc.). The verb can take pre-defined functions (the usual CRUD ops), and also custom-defined ops (where the function code is passed along as the request body).

This model treats RPC as first-class citizen (where all the regular operations, such as file CRUD operations and notifications are implemented on top of RPC). Not sure how difficult/easy it would be to do this with present architecture, though.

For the pubsub, wondering how easy/complex it would be to reuse/integrate the Autobahn. If it can be done, then RPC comes for free on top of it (Demos).

jbenet commented 8 years ago

AFAICT, autobahn requires websockets, that's too much, we need something more basic that can run over any transport. we also need a pub/sub that can scale to millions of subscribers -- this isnt going to cut it: http://wamp-proto.org/why/#unified_routing -- we need a protocol that creates spanning trees/dags with fan out on its own, using measured latencies/bw in the network, etc. basically, a serious protocol from the multicast literature.

jbenet commented 8 years ago

@KrishnaPG you should read more into how IPFS works and how the protocols work. suggest also looking at:

KrishnaPG commented 8 years ago

Thanks @jbenet I was looking for the spec info, your pointers are helpful.

As for WAMP, yes - it started out as websocket based initially. Now, it is decoupled and works with any message based transport (http://wamp-proto.org/faq/#is_websocket_necessary_for_wamp)

However, my intention in pointing to WAMP was not to use it as is, but rather to adapt its pubsub+RPC part of the spec while removing the broker part (replacing it with whatever routing the ipfs uses, dht etc..)

WAMP is a perfect protocol for IOT requirements, but the strong dependency on router/broker is a deal-breaker.

jbenet commented 8 years ago

@KrishnaPG ah ok, good they generalized it

fsantanna commented 8 years ago

Hi, I see a lot of discussion on how to implement pub/sub, but not really about the semantics of pub/sub for IPFS (is it that too obvious?). How will pub/sub be exposed to users (the API)? The idea is to have something as simple as

machine-1$ ipfs sub <topic>
<hash-1>  # receive when this is published
<hash-2>
...

machine-2$ ipfs pub <topic> <hash-1>

machine-3$ ipfs pub <topic> <hash-2>

or I am missing something here?

fazo96 commented 8 years ago

@fsantanna

I think the bare minimum pub/sub should expose this kind of interface:

ipfs.sub(target_node_id, handler)

handler would be called every time IPFS finds a new hash published by target_node_id. Of course a more complicated API could be built, this is the bare minimum but still very useful implementation.

CLI version:

$ ipfs sub $target_node_id
/ipfs/...
/ipfs/...
# A new line is emitted every time a new record is found

davidar commented 8 years ago

Re RPC: we've discussed this briefly before, but I'm not sure what the actual plans are

fsantanna commented 8 years ago

@fazo96 In this case, the subscriber has to know about (and subscribe to) every single potential content provider of his interest. Shouldn't pubsub decouple publishers from subscribers?

fsantanna commented 8 years ago

Hi, I couldn't resist and made a proof-of-concept implementation of pub/sub. :)

https://github.com/fsantanna/go-ipfs/blob/ipps/IPPS.md

I describe the desired API, show some examples (with screencasts), and discuss the naive implementation.

Thanks, Francisco

mitar commented 8 years ago

What about using https://en.wikipedia.org/wiki/PubSubHubbub ?

mitar commented 8 years ago

One more question, what about "promises" or "futures"? It would be great to be able to get some "bucket" (ID) in advance and once value is available user can add it under that bucket, resolving the promise/future. Of course the issue here is that in plain IPFS hash is based on the content, but content is not yet known at that moment.

Then pub/sub could be seen as a series of such promises. Where one value would tell you the next bucket on which to wait for next value. And so on. (Not sure how performant this would be, but it could be a nice minimal API.)

sneakin commented 8 years ago

I was thinking about p2p pubsub yesterday, not directly tied to IPFS, but the idea may apply since my toy system is a Kademlia network of content addressed storage servers. Joe Armstrong's gittorrent schtick has infested my mind as has IPFS & Tahoe-LAFS.

My idea for a solution was to have subscribers store matchers (on key/values, xquery, bytecode) on the network that need refreshing. These matchers are distributed like a Kademlia network distributes data: to the nodes that are nearest to some origin key. The matcher's keyS for distribution would be derived from the keys the matcher checks.

Now subscribers have matchers located near where the data is expected to arrive, and these matchers know where to send any matches. So once a message is distributed according to it's keys (really thinking email style key/values + body, or a standard json like body linked to a blob on the network), it hits a node that has a matcher that matches, sending the message to the subscriber/matcher owner.

For one to many I was thinking of matching on the publisher's key used to sign the message. That would be authenticated too. Authenticated many to many may be tricky. Unauthenticated spam channels would be arbitrary key/values: X-Geolocation: ABQ. Combine with an authenticated Sender spam may be limited.

And this could probably morph into a real-time map/reduce network. But there's my drive-by help.

mitar commented 8 years ago

BTW, how does pub/sub relate to streaming? Will IPFS support streaming (is there a ticket for it)? So that one could stream movies or never-ending streams of data in real-time?

eboto commented 8 years ago

Had very similar thoughts to @mitar over the weekend. One nice property is that you can implement one-way pubsub trivially without changing anything about the current IPFS client or network, and without introducing new p2p message types of any kind. Here's a writeup on the topic:

Proposed low-perf IPFS/IPNS pubsub

fsantanna commented 8 years ago

Thanks @eboto .

After going through you document, one-way pubsub means:

Not directly possible to discover new content, since you need a channel identifier before hand.
Only the key owner can publish to that channel, since you need IPNS.
Publishing is centralized, since you need a precomputed forward link attached to the current post.

Is this correct?

In any case, not requiring any changes to IPFS is a huge benefit! It is a good drop in replacement for RSS.

hackergrrl commented 8 years ago

@eboto: nice work and write-up!

One thing I wasn't clear on was how this would differ in performance vs simply periodically polling the user's IPNS directly? If Alice shared /ipns/alicepubkey/blog/1 widely, I could poll for that until it resolves. Once it does, I could poll for /ipns/alicepubkey/blog/2 periodically, and so on. (Maybe she also publishes /ipns/alicepubkey/blog/index too, which lets me get speedy random access to any of per posts.)

eboto commented 8 years ago

@fsantanna @noffle thanks for reading the post!

@noffle You may be right. I was working on a few assumptions -- are they incorrect?

I had assumed pushing preferable to polling. I thought enough pollers would challenge whatever system is responsible for name resolution.
I had assumed the wantlist provides a push solution, as I understand that the wantlist gets satisfied by a peer publishing that it has acquired a particular block, which I interpretted as a push.

Is there anywhere I can go to read more about how the router and wantlist behave? I have read the IPFS paper already. Telling me to just go read the code is also OK =)

@fsantanna, I hope I understand your points well. These are I think some workarounds:

HOW TO DISCOVER NEW CONTENT

It's true that with this scheme you do need to receive exactly one publication event to discover a content stream. But that link could be provided in various ways that facilitate discovery. e.g. Alice embeds Bob's first publish event as an ipfs:// link to the PayloadTree in her webpage. By clicking, and therefore wanting that link, her viewers can now subscribe to Bob too.

I think this is exactly how content gets discovered currently: you go to reddit and that links you to the content you're actually interested in.

MULTIPLE PUBLISHERS AND DISTRIBUTED PUBLICATION

You could simulate multiple-owner publication (and distributed publication) by creating a small network of peers that re-publish each other's work.

e.g. Imagine Alice and Bob are co-authoring their marriage vows vows.txt, which they want to share with Mama and Papa.

Alice and Bob subscribe to each other with the scheme in my previous post.
Mama subscribes to Alice. Papa subscribes to Bob.
Alice publishes an event with payload I, Alice, take you, Bob, as my husband. to /ipns/alice/vows/1
Bob and Mama receive the event, updating their vows.txt. Papa still doesn't have the event.
Bob immediately publishes an event with the same payload, but on his own stream: /ipns/bob/vows/1
Papa dereferences /ipns/bob/vows/1/data, and receives the data from either Alice, Bob, or Mama (each of which have already pinned a copy of the payload). He updates his vows.txt and puts Bob's next update (/ipns/bob/vows/1/next) on his wantlist, continuing his subscription to Bob.
Bob, not to be left out, publishes an event with payload I, Bob, take you, Alice, as my wife.
Papa and Alice receive the event, updating their vows.txt.
Mama receives the event via Alice the same way that Papa received the first via Bob.
At the end, all 4 vows.txt copies say I, Alice take you, Bob, as my husband. I, Bob, take you, Alice, as my wife.

This solution sucks a bit because clients that subscribe to republishers pay the IPNS delay cost for each index of graph distance they subscribe to away from the original source. (e.g. Papa in the example above had to wait for Alice's IPNS to update, then for Bob's to update, before he could receive the payload from Bob)

You'd also need some scheme for concurrent update management and preventing infinite republications.

DISTRIBUTED PUBLICATION

You mentioned that publication is centralized in this model. That is true. However that may not be a problem considering these points:

Knowledge has to come from somewhere, so initial publication of some new datum is likely to come from a single peer.
If I understand the wantlist correctly (and its very likely that I don't) I think that although publishing is centralized, distribution is distributed. A peer that is distant from the publisher in the network should be able to receive both the resolved publication event and that event's payload through the network.

amstocker commented 8 years ago

Why not integrate the Matrix protocol into IPFS? It could function as a pub/sub layer and would allow an IPFS node to interop with many other existing projects.

fazo96 commented 8 years ago

@amstocker looks like a client/server protocol, I don't think it's the right fit for IPFS Pub/sub because some nodes will need to talk directly and a direct TCP connection won't be possible and UDP packets won't arrive (for example due to NAT, or restrictive firewalls).

This could be solved by relaying on other nodes and it doesn't look like the Matrix protocol was built with that in mind. Matrix also uses an account system, while IPFS uses asymmetric encryption for authentication.

We can learn from it for sure, but it doesn't look like a fit.

ara4n commented 8 years ago

@fazo96 Matrix is indeed split into server<->server and client<->server API, but that doesn't stop the client running in the same device/app as the server, at which point we have more p2p semantics rather than being so-called "client/server". The server<->server bits of it are currently HTTP-discovered-by-SRV-records, but we have some longer term plans to look at a purely p2p soln for server<->server using something like libp2p (https://github.com/matrix-org/GSoC/blob/master/IDEAS.md#peer-to-peer-matrix).

The account thing isn't a big obstacle as we support mapping any types of identifiers (public keys, email addresses, MSISDNs, whatever) onto matrix IDs, which are currently @username:domain style identifiers. Again, in future we're looking at switching to a fully decentralised DNS-less ID system - probably public keys again - as per https://github.com/matrix-org/GSoC/blob/master/IDEAS.md#decentralised-accounts.

Eitherway: the main thing that Matrix brings to the party is a set of semantics for eventually-consistent distribution of real-time comms data - specifically, a timeline of messages (expressed as a DAG, which may bifurcate and re-form), and a simple set of key-value data. One could layer this on top of IPFS semantics, or keep it in its own layer (as we're currently doing). Whatever happens, Matrix is called Matrix because we're obsessed with building bridges to everything - so it should be trivial to build bridges between an IPFS pubsub ecosystem and Matrix's decentralised communication DAGs :).

If you have any questions, drop by https://vector.im/develop/#/room/#matrix-dev:matrix.org.

fazo96 commented 8 years ago

@ara4n thanks for the detailed reply! I didn't know the Matrix protocol was so wide in scope.

mildred commented 8 years ago

@eboto have you started working on your pubsub proposal ?

reacting to @fsantanna :

Not directly possible to discover new content, since you need a channel identifier before hand.

Only the key owner can publish to that channel, since you need IPNS.

Is there anything preventing the secret key not to be that secret, but shared among those allowed to publish changes ? It could even go public.

Publishing is centralized, since you need a precomputed forward link attached to the current post.

Knowing the secret key, the forward link payload could be known as well.

elimisteve commented 7 years ago

Is there anything preventing the secret key not to be that secret, but shared among those allowed to publish changes ?

@mildred I'm wondering the same thing. I want to create shared folders of files that specific people can modify but that anyone in the world can read.

ec1oud commented 7 years ago

So is this likely to be the way forward? https://github.com/libp2p/go-floodsub

daviddias commented 6 years ago

To everyone following this thread, check out the latest Tutorial published by @pgte on how to use PubSub to create Real-Time apps over IPFS

Full discussion here: https://github.com/libp2p/research-pubsub/issues/18

daviddias commented 6 years ago

Folks here might be interested on this thread https://github.com/ipfs/ipfs/issues/244

RangerMauve commented 6 years ago

Is there a description of how the floodsub implementation works?

jachiang commented 6 years ago

+1

jachiang commented 6 years ago

How many hops does Floodsub currently propagate from sending node? Thanks!

Stebalien commented 6 years ago

Is there a description of how the floodsub implementation works?

Peers advertise the topics they're subscribed to to all connected peers.
When receiving a message on a subscribed topic for the first time, a peer will forward it to all connected peers (except the sender) that are subscribed to that topic.

If this sounds really dumb, it is. That's why it's still "experimental" (there's a lot of room for optimization).

How many hops does Floodsub currently propagate from sending node? Thanks!

Forever (cycles are prevented by remembering recently forwarded messages).

RangerMauve commented 6 years ago

So peers that aren't subscribed to a topic won't propagate it?

Stebalien commented 6 years ago

@RangerMauve yes. It would be nice to decouple subscribing and broadcasting (to allow passive rebroadcast nodes for certain topics) but you can currently just emulate that by subscribing and throwing the results away (with little overhead).

We don't want all nodes to just rebroadcast everything they receive. That would be a perfect DoS vector.

RangerMauve commented 6 years ago

@Stebalien Is there anything in place for discovering peers that are interested in a given topic?

jachiang commented 6 years ago

@Stebalien Q1) +1 for @RangerMauve question on discovering peers topic subscriptions

Q2) Also, if two peers subscribing to the same node are connected via a third node (not subscribing to topic), would that third node not bridge the two subscribing nodes by rebroadcasting? I tried this with 3 separate IPFS nodes, by bootstrapping the first two nodes located behind firewalls with only the address of the third node, and the two nodes received topic messages. So I assumed it was rebroadcast by the non-subscribed man in the middle ... did I miss something here?

Stebalien commented 6 years ago

@RangerMauve Sort of... I'm not sure about js-ipfs but the go-ipfs ipfs pubsub command has an optional 'discovery' option that uses the DHT. Unfortunately, this is kind of a hack. To register "interest" in a topic, one uses the provider system to provide a block with the content floodsub:$topic_name and peers looking for other peers interested in the topic search for peers providing that block.

@jachiang

would that third node not bridge the two subscribing nodes by rebroadcasting

Are you sure the other two nodes weren't directly connected? IPFS tries pretty hard to connect to new peers and can bypass some NATs via hole punching. I'd take a look at the output if ipfs swarm peers on both of those peers.

RangerMauve commented 6 years ago

@Stebalien So is the floodsub:$topic_name pattern the way to go for bypassing the DHT not allowing arbitrary get/put? :P

Thank you for the information! Would a new pubsub implementation be welcome if it conforms to the same public interface?

ipfs / notes

pub/sub - publish / subscribe #64