ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16k stars 3k forks source link

IPFS pin additions #6006

Open markg85 opened 5 years ago

markg85 commented 5 years ago

Hi,

I'm not quite sure if i should post this here (js-ipfs) or here https://github.com/ipfs/ipfs as an api update. Please do correct me if it doesn't belong here.

Yesterday i wrote a Push notification API proposal. Again, for that one i'm also not sure if i posted it at the right location. In that proposal i'm suggesting a new pin option: ipfs pin --till-delivered

Here i'd like to extend that to: ipfs pin --till-delivered --target \<CID> --expire-in \<future date>

Use case When having files that should be downloaded only once or shared for one specific target. File sharing services sometimes have an option like that to make a file available at X location for Y time for someone to download it. The same logic can be used for push notifications. The notification needs to be pinned till it has been received.

To elaborate IPFS probably needs to have some mechanism that would allow a request like "Hey, i'm \<cid>, is there a new pin intended for me?". That would allow the target to handle the object. The "--expire-in" flag merely allows some automatic pin cleanup for when something has been received and is then not required to be available anymore. It can still be available, that's fine. It just doesn't "have" to be anymore.

I'm looking forward to hearing your thoughts.

Cheers, Mark

MikeFair commented 5 years ago

Some thoughts:


It's not clear to me how you would validate the recipient unless it was another pubKey identifier and not just some random CID. When I connect to a node, I can claim to be any CID I wish.

If the message CID has been stored on multiple nodes, not all of which are aware of the pin; and the recipient gets the message from one of those caching nodes, how does the pin service get informed the message was delivered?


It seems like you'd want a special/new kind of PubSub channel/message type for this.

The way I see this is you have points (SRC) (of which there could be several) that are holding a message for point (DST) (a singular destination); in between are a whole lot of peer nodes that can form a path between SRC and DST.

I'm assuming SRC is online most of the time and DST is online only intermittently. I'm assuming PubSub will be used to inform DST of a message presence at SRC.

In my mind's eye SRC subscribes to a PubSub DST topic ID. Say the topic is called "/IPMQ/[DST]"

When DST comes online it subscribes to the "/IPMQ/[DST]/MSGS" topic.

The DST identity then signs and Publishes an "I'm here message" to the "/IPMQ/[DST]" topic using its private key. It signs the message so all the subscribers can validate that DST is actually the one that sent the message.

All the SRC nodes subscribed receive the message and validate that DST sent it. The SRC nodes then broadcast the CID of their messages to the "/IPMQ/[DST]/MSGS" topic.

The DST node receives these announcements and begins requesting the CIDs of the messages. It puts together the list of all CIDs successfully retrieved then signs and publishes an "I've received: [list,of,CIDS]" message to the "/IPMQ/[DST]" topic all the SRC nodes are subscribed to.

The SRC nodes validate the message came from DST and scan the list for the CIDs they were responsible for. For every CID on the list that SRC was responsible for, it marks the CID as complete/delivered and unpins it. If there are no more messages for DST, the SRC node can unsub from the "/IPMQ/[DST]" topic.

If SRC is offline, then when it comes online, it subscribes to the "/IPMQ/[DST]" topic and broadcasts the list of MSGIDs to the "/IPMQ/[DST]/MSGS" topic. It was not online when DST announced its presence and does not know if it is there. SRC also does this as soon as the MSG is first requested to be delivered.

If the SRC is offline at the time DST gets MSG, it won't see the "I've received:" message. When it comes online, it will inform DST of the MSG a second time. It is up to DST to understand this is an announcement for an already received message and publish the CID in a new "I've Received:" message to inform the announcing SRC the message has been received.


Mike

markg85 commented 5 years ago

Hi @MikeFair

Thank you for the interesting and complete writeup :) My proposal was more if a "i like this and i think this is a way how it can be done". If PubSub is more suited, then that should obviously be used.

Regardless of the method, I do think some form of persistent messages (that don't have to be delivered immediately but as soon as you come online) is a missing feature that would allow for quite some great new functionality when implemented. For static sites for example, it allows just a tad bit more dynamic behavior without the need for a server or becoming centralized again. It allows for a notification service to be written and be truly decentralized! And probably a whole slew more of interesting applications.

As for how i would validate the CID (which you already described) would indeed be by means of public/private key. The user (SRC) has a private key and signs the message with the public key of the target (DST). That message itself (the encrypted one) is then stored on IPFS as a file and again hashed in the IPFS hashing scheme. And this is indeed where i would be stuck as from this point on there would be no way for other nodes to know what the DST is of CID. You indeed need a mapping somewhere of DST -> CID. Yeah, i think your idea works better :)

MikeFair commented 5 years ago

Hi @markg85,

I'm glad you were able to work through the shortcomings of implementing the idea as described in IPFS. There are several problems with the way it was described. (1) There is no concept of being online/offline; logging in/logging off. The statement of "when the target comes online" doesn't really have meaning... The IPFS infrastructure is always online, and peers are always exchanging information, and there is no concept at all of "comes online". The target peer can either be reached right now, or it can't.

(2) IPFS doesn't store "messages" it stores "Persistent Content"; if that Content happens to be a message so be it. Think of IPFS as like a network of library buildings and the CIDs as the books; they just sit there. You have to ask for the one you want, and if it's not in your local library you have to ask the other libraries if they have it (they can forward your request to other libraries they know about to help you find what you're looking for too).

What you described is the equivalent of saying "Next time I walk into any library building I want all the books registered for me from every library building on the planet to fly into my hands". That's just not how library buildings and books think/work; the books have no idea of events happening in the libraries, the libraries have no idea what books are being added/removed, especially in other buildings. And neither the books nor the buildings have any concept whatsoever regarding what the heck "you" are.

The PubSub system is something that can send something like "event signals" (I'm here, I'm gone, I've received, I've sent, I'm looking for, I'm waiting, etc.). It's like the broadcast speaker system wired into the whole network of library buildings. So when you walk into the building you announce "Elvis has entered the building" on a channel/topic where something is listening. By default, nothing is listening to anything, the buildings don't have ears (they just have speakers so things that can listen could receive the broadcasts).

In this metaphor, the library buildings are the IPFS daemons, and the identifiers for the books are CIDs. It would be wrong to tell every library on the planet that someone reserved a book for you and make them track that information. What would be better is if each library kept its own local registry for any books it has for you, and then that library started listening for an announcement that says you arrived somewhere. It can then announce back "Here are the book ids on hold for you".

What's really important here is that the library doesn't care where you get the book from. You don't have to ask that specific library for a copy of the book. You can get the book from any library, then you tell the library making the announcement that you received the book to get it to stop telling you.

(3) IPFS should be treated as a network of paths that link nodes and data together; not assuming nodes always directly connect to each other to deliver data. What you seem to be missing in the original proposal is that nodes have zero knowledge of other nodes except through discovery and explicit assignment. When they hunt for data they ask their local peers for information.

I think unintentionally you didn't quite comprehend that you are asking IPFS to understand events it currently has no awareness of, and certainly not across multiple nodes. Nodes do not know when your local node starts or stops, other nodes do not know if or when you receive data, they have no idea who is requesting what (they only know what information their neighbor asked for, not why it asked). if you had a message for me, and instead of getting that message from you I received it from your neighbor, you couldn't know; You'd have to be told. PubSub is the mechanism by which you can be told I got the data, even if I didn't get it from you.

So this is a rather long winded way of saying use PubSub to coordinate the registration and fulfillment of messages, and IPFS CIDs to store and deliver the messages themselves. :-)

markg85 commented 5 years ago

Hi @MikeFair thanks again for the awesome and clear writeup!

I do see one problem in your analogy that you might just not have mentioned to keep the reply somewhat reasonable in length ;)

In your description "someone" needs to listen for my messages when i'm not there. In your analogy that "someone" is all the nodes. Aka, each library is then listening for messages aimed towards DST to keep a list of them. When DST comes online and subscribes to the same channel he will be flooded by all the nodes that kept his messages. Basically a (pun intended) DDOS! Like you already said a few posts back.

It would be ideal if, say, the MSG for DST is only stored on n-th (say not more then 10) nodes at any time. But then you'd also get into the hassle of maintaining who has MSG for DST and limiting it to max 10 but re-mirroring it in case one goes off. Not maintaining this information saves the hassle of bookkeeping, but introduces the likelyhood of DDOS'ing DST when it comes online.

Last thing. How would DST be identified? Earlier i said with public-private key encryption. Here i obviously intended that to be hassle free for the user as the user already gets a private key when starting IPFS. If each user has their own IPFS client this whole question is moot. But that's just not the case, some portion of the users might be running the client and some other portion might be accessing IPFS through gateways. That poses an issue as this logic would only work if each DST is effectively a single IPFS client. If a gateway is used, that logic falls apart. The only way i see now to resolve that part would be if the user, also on the gateway, identifies their identity. Ideally you do not want a user to generate their own pubic and private key because that just is a hassle and raises the bar quite substantially to even use whatever IPFS app is made with it. It also brings along the issue with moving the keys around for the user if accessing IPFS on different devices.

So many potential issues here!

MikeFair commented 5 years ago

In your description "someone" needs to listen for my messages when i'm not there. In your analogy that "someone" is all the nodes. Aka, each library is then listening for messages aimed towards DST to keep a list of them.

Actually, not all of them, in my description it was only one per message; the one the sender registered the message with.

Or like you suggested, some algorithm like the IPFS cluster technique to ensure data is pinned on some number of active/live nodes within the group.

My main point here is that this idea is ill-suited for integration into IPFS itself. It ought to be a set of service agents that use IPFS to do their work.

I was saying that this ought to be a third party client software that manages the registrations and delivery; software that used IPFS and not an ipfs daemon builtin.

These clients could subscribe/listen on the "/IPMQ" channel for the "I'd like to register a message" events (in all honesty I think the topic should be some service CID instead of a human readable label, but that makes things harder to describe). Using other PubSub channels they can coordinate how they collectively handle saving the registration request.

As there would potentially be many, many of these active registry clients all over the Internet, then for every message, some algorithm would determine how the registrations were distributed and tracked amongst them. This would likely become something like a second DHT these agents maintained amongst themselves but could be anything that works really.

When DST comes online and subscribes to the same channel he will be flooded by all the nodes that kept his messages. Basically a (pun intended) DDOS! Like you already said a few posts back.

This is always a concern in P2P systems. DOS attacks can flood a resource with messages they are supposed to process, overwhelm storage resources by requesting to store lots of useless garbage, bog down processing speed by adding lots of useless little entries that make things harder to find. Basically any time you have entity A that can request entity B do work; and it costs entity B more energy/resources to process/handle/serve the request than it does for entity A to make the request; you have a D/DOS exposure risk and you have to figure out how you prevent all the entity A's of the world from DOS attacking the entity Bs.

In this case, storing a message is much more expensive than requesting a message be stored. So there are two attack risks I immediately see.

1) You can flood DST with messages, either through registering millions of them; or an ill-conceived announcement system. The reason the current PubSub system has been nicknamed "FloodSub" is because it will deliver the same message multiple times depending on the routing paths involved.

2) A malicious actor could register millions of messages where DST is a conpletely fictitious entity and the messages will simply never get delivered- clogging up the registry system.

Last thing. How would DST be identified? Earlier i said with public-private key encryption. Here i obviously intended that to be hassle free for the user as the user already gets a private key when starting IPFS.

I actually fervently disagree. The IPFS daemon has a keypair to identity itself on the p2p network. This is not the same thing as a userid on IPFS.

It's completely reasonable to suggest that every user gets their own keypair, it just a competely separate keypair from the ipfs daemon id. IOW yoy have to completely abandon any correlation between the DST id and running code. The software will process msgs for many users and <10% of them will keep their own daemon with them everywhere they go. Think of email. There isn't one email server per user. There are many userids per email server and all the servers talk to each other to coordinate who has, and is storing the messages, for each user.

Ideally you do not want a user to generate their own pubic and private key because that just is a hassle and raises the bar quite substantially to even use whatever IPFS app is made with it. It also brings along the issue with moving the keys around for the user if accessing IPFS on different devices.

The user will have to identify themselves somehow and if they aren't able to manage their own identity they will have to really on some centralized service to do it for them.

The entire spirit behind "Decentralized IDentities" (DIDs) is specifically to create self-sovereign individual identities.

You could, in theory, use the idea of connecting using a VPN back to some "trusted network" to manage your presence in the network. This way you are trusting the POP wherever you are to help you establish a VPN connection and are not entrusting it with any secrets.

So many potential issues here! Yep, which is exactly why I was steering you away from the concept that this idea belongs inside the IPFS daemon/protocol itself. It's better conceived as a software service layer that is using IPFS instead. That way it can't "gum up the works" of IPFS itself with things the IPFS design was never intended to solve. :-)

Mike

markg85 commented 5 years ago

My main point here is that this idea is ill-suited for integration into IPFS itself. It ought to be a set of service agents that use IPFS to do their work.

I was saying that this ought to be a third party client software that manages the registrations and delivery; software that used IPFS and not an ipfs daemon builtin.

and

Yep, which is exactly why I was steering you away from the concept that this idea belongs inside the IPFS daemon/protocol itself. It's better conceived as a software service layer that is using IPFS instead. That way it can't "gum up the works" of IPFS itself with things the IPFS design was never intended to solve. :-)

I heavily disagree with that. My whole point is to have a decentralized persistent notification system. With what you suggest it becomes centralized. Even though the backend would be decentralized, the entry points are not.

It is far better suited to be within IPFS in my opinion.

MikeFair commented 5 years ago

It's better conceived as a software service layer that is using IPFS instead. That way it can't "gum up the works" of IPFS itself with things the IPFS design was never intended to solve. :-)

I heavily disagree with that. My whole point is to have a decentralized persistent notification system. With what you suggest it becomes centralized. Even though the backend would be decentralized, the entry points are not.

It is far better suited to be within IPFS in my opinion.

You should look up "OrbitDB"; if you created a scheme where each recipient had a predictable database id, then it pretty much should do exactly what you want.

https://medium.com/coinmonks/orbitdb-a-peer-to-peer-database-for-the-decentralized-web-30bac1d056fe As for the idea that it belongs inside IPFS, I'll try one more time to explain both why this is a "higher layer" application to run on top of IPFS, and why the idea, as conceived, is a DDOS/SPAM attack engine then I'll leave it alone:

(A) I don't think you're taking into account how the described system can be abused when you force daemons to store and deliver data from arbitrary and random sources to unsuspecting third parties. IPFS daemons currently don't do this, they are about helping you locate the data you are looking for, not ensuring the data you are looking for is available.

As described, what you would have built is a DDOS/SPAM attack network. Simply generate a bazillion messages with different CIDs to spread them throughout the network and wait for the intended target to log in. Spin up a bunch of web browsers using js-ipfs and have them constantly generate and register more random messages too. This doesn't even need to be a real time attack because the messages get stored up in the queue waiting to be unleashed on the unsuspecting target. The longer they are away, the worse the storm gets when they next log on.

It also effectively anonymizes the attack because the IPFS peers are executing the attack for you and the sources of the messages are completely arbitrary/random. You must not enable IPFS peers to be unwittingly used to store arbitrary data, uploaded by arbitrary people, to be actively delivered to arbitrary targets.

Further, if you simply wanted to bog down a section of the IPFS network (because you don't like some CID content it's hosting), you could just generate millions of messages that should all be registered on that same section of nodes. You also fire up another bazillion browser based javascript nodes to help. To get the nodes in that section to help you with the attack, you ensure that all the targets of these messages are all nodes in that same section of the network, so those nodes are constantly overloaded sending themselves a bunch of useless messages they never asked for, wanted, could use, and can't really ignore.

It's strange to me that you really can't yet see how dangerous it is to enable millions of people to upload content (messages) into a network, that the network administrators will pay to store indefinitely for them, to then be delivered to millions of other people, without the recipient target's request/consent. That's the very definition of an anonymous attack bot network.

Sing it with me "Spam, Spam, Spam, Spam... Spam, Spam, Spam, Spam... It's Spam! Wonderful Spam! ..." :-D

Currently it's really hard to get IPFS to send Spam or execute a DDOS attack against others. You can clog up a PubSub channel if you know the channel topic. But there's no way currently for you to use IPFS to force others to accept your data from you, and you can not force them to host/store your data for you. Others have to know what they are asking for from you, and others have to choose to help you (e.g. pinning/caching) if you're not there.

It's a decent effort, self-focused, distributed system. And that's a safety feature, not a bug.

(B) IPFS doesn't store data or files on behalf of others with any kind of guarantees for exactly this reason.

Each user is responsible for taking care of their own persistent content storage and this is a good thing.

IPFS does not upload and pin your data for you. You have to ensure a daemon is online to serve out your published data. The network doesn't do this for you and this is the "price" you pay for being able to distribute data via IPFS. It will happily "deliver CIDs" for you, but it only helps store them after they prove that requesters actually want them and it doesn't actually promise/guarantee it will help you, it is simply likely to help you, if its local caching space believes it can, but it makes no promises.

IPFS is centralized to you in this "who provides the long term persistent storage" for you way...

This registered messages idea is no different, you ought to be responsible for ensuring the messages you are asking be stored/published for others are being registered/tracked/delivered.

Technically, a message record is very close to an IPNS record, and you might be interested to know that IPNS records currently use the PubSub system to announce to other interested parties when they have been updated/changed. (I'd like to point out those recipients have explicitly registered to receive them.)

In every case throughout the IPFS system, you only receive information/messages that you've asked for, and no node is ever expected to store any data beyond the routing information of what node you can go get the data from. An IPFS node is not expected to proactively cache or store any information for any other node for any period of time (this extends to include persistent messages for other targets).

(C) This doesn't make the solution I propose centralized. Anyone and everyone can run a message registry node, and registry nodes are free to collaborate on sharing information with each other via the PubSub system (just like IPNS records do) or libp2p like OrbitDB does.

There is nothing that prevents running a message registration node as a plugin to an IPFS daemon to ensure there are lots of message registration nodes out there. But registered messages simply isn't the same problem as the content addressed routing and distribution problem IPFS daemons are solving.

Responding to events like a user coming online requires some kind of psuedo-centralized coordination. What if your computer at home was logged in, then you log in again on your smartphone, and again on your machine at your work? What is the registered message system supposed to do with that?

While it seems like a simple concept, user based message delivery isn't simple. Essentially this idea is inventing email over IPFS (think about it).


As described, this registered message idea creates the concept of a registered user, something IPFS currently doesn't have and is quite hard to manage in a truly decentralized way (IPFS doesn't have users).

It creates the idea of nodes needing to know when other nodes are online or not, something IPFS currently doesn't do.

It requires nodes to store/track data on behalf of others regardless of whether or not that data is even valid or interesting to the intended recipients, something IPFS doesn't and shouldn't do.

It enables malactors to force delivery of messages the recipients haven't asked for and may not even want, something IPFS should never do.


The current design of IPFS is a very elegant "PULL only" model where the users have to ask for what they want, or in the case of PubSub, actually be online at the time to receive the message. Nodes are transient (no guarantee when they will or won't be online), only responsible for themselves (have no dependency on other nodes), and make local resource consumption choices suited to their own needs (they decide what stays in their cache and what gets flushed).

OrbitDB however creates the concept of a persistent online database and it uses the features of IPFS to do it; it doesn't require changes to the IPFS content routing and message delivery system, it uses them.

If every user had a message database that people could write to (like the event log database), then when the user came back online it could read the entries added to the database since they had left.

Logging on to multiple devices is no longer a problem because the OrbitDB database is persistent. The user can be responsible for ensuring their own message database is online and provide the persistent storage for it (even via third party services)...

IPFS has ideas about charging people for their persistent storage consumption, and the "bitswap" algorithm is about balancing out how much you use from the network with how much you help it.

Higher level services can create/use the concept of users and be more proactive about following events. It can also deal more proactively with network/feature abuses...

Hope that helps!

Mike