libp2p / research-pubsub

Research on PubSub algorithms for libp2p
https://github.com/libp2p/specs/tree/master/pubsub
MIT License
76 stars 6 forks source link

Filtering beyond topics #13

Open Beanow opened 7 years ago

Beanow commented 7 years ago

Checking out #3 I recalled some Bitcoin experiences with their relay.

Bitcoin, similar to Etherium and other blockchain based systems has a TCP relay sending.

The crux of it is the inv message which unlike the various getX commands can be unsolicited and announces to peers which objects they've become aware of.

Field Size Description Data type Comments
4 type uint32_t Identifies the object type linked to this inventory
32 hash char[32] Hash of the object

The inv messages can be seen as the pubsub of several high traffic topics. Where the getX commands can be seen as IPFS's normal want list based operation.

Filters

Spamming with inv soon became too much for thin clients and they introduced bloom filters as a tradeoff between bandwidth saving and privacy (the filter would reveal to some degree what addresses a node is interested in). See filterload, filteradd, filterclear, merkleblock and BIP 0037.

Similarly a filter for minimum transaction fees is available primarily to stop mempool attacks. BIP 133

I think the takeaway here is; for high traffic topics, asking your peers to filter their messages further may be required.

OrbitDB for example currently maps databases and topics 1:1. If your topic is npm-package-metadata you've got yourself a shitstorm of messages.

Problems with that

The problem with filters like these is, they are typically applied to the data rather than the object hashes. And IPFS clients are agnostic to the data. So I believe this is an interesting problem to think about.

Obviously sharding strategies might be applied to the topics. But you still get flooded by whatever is associated with a shard. Take NPM again, if you had 450k packages that each have an entry that gets updated ever so often. Suppose we shard across an arbitrary 512 topics, with a perfect hashing/routing algorithm we would have about 879 packages per shard. Much better, but far from ideal for various different use-cases.

Say we have 10 specific packages we're interested in for some continuous integration setup. With bad luck those are all in different shards. You have to download and sift through 10x 879 packages' updates. And they may be updated way more frequently than the packages we're interested in. The only way to know if we can discard hashes we get from pubsub is to obtain the associated data, parse and then filter it based on the data. In other words, we cannot discard them at all.

So even if the IPFS client itself may not be able to apply the filters to the data rather than hashes. Having a well understood protocol to send filter requests to your peers could be extremely useful for any kind of high traffic topic.

You might say, make 450k topics, one per package and subscribe to those 10 packages. But that would force people that want to have an NPM registry to subscribe to all 450k topics. That would itself create terrible overhead between peers, as well as adding lots of churn in the merging of the oplogs of the database.

Using topics alone forces you to make one solution for all use-cases of the underlying data. A you-can't-please-everyone situation. So I suggest looking into peer-level controls like these filters.

mitar commented 7 years ago

See my ideas here about this: https://github.com/ipfs/go-ipfs/issues/3741