Networking difficulties while pinning data

aschmahmann commented 4 years ago

Problem:

The current API has the client inform the pinning service of the CID of the data to pin. While this may be convenient if the data is already in the network, it has downsides if the client is the only one with the data including:

If the client node is unreachable (e.g. behind a symmetric NAT) and they're hoping to use a pinning service to make their content publicly accessible then they're not going to be able to get their data to the pinning service since the pinning service will not be able to reach them
Even if the client is reachable the pinning service still needs to wait for a DHT provide to complete before they can start retrieving the data. This may not be a huge problem, but it is definitely annoying.
- As a bonus problem if the client node dies in the middle of uploading a large amount of data then the pinning service will have to wait for a large number of CIDs to be provided, not just the CID of the pin object root

Comparing with Other Solutions

Instead of just sending the pin object CID send the entire pin object
- Pros:
  - Pretty easy to implement
- Cons:
  - Does not allow for reduced bandwidth usage in the event the some part of the DAG is already stored by the pinning service
  - Does not allow for resuming cancelled uploads (very possible during the upload of large data)
Have the client nodes peer with some upload nodes from the pinning service before they send the query so that they will get pinged by Bitswap and not be dependent on a DHT lookup
- Pros:
  - Requires minimal additional code in go-ipfs (js-ipfs doesn't having peering implemented yet)
- Cons:
  - Requires adding both an HTTP endpoint and a libp2p upload endpoint
  - libp2p upload endpoints cannot AFAIK make use of CA certificates which means needing to have a consistent set of peerIDs that are used by the upload endpoints, or relying on DNSLink which isn't signed
  - AFAIK we can't really load balance (inbound) pinning requests (aside from having multiple target nodes and just choosing one of them)
  - Some brittleness/complexity related to when connections break (as they sometimes do)
  - what happens if the peering connection is temporarily broken when the HTTP request goes out?
  - what happens if the peering connection breaks during the upload?
  - for these cases when the connection is re-established will they still be in the session, when/how will they be re-added?
Use the proposed HTTP API, but do so over libp2p
- i.e. instead of sending a standard HTTPS request to pinning.service form a libp2p connection to /pinning/service and send the HTTP requests over that connection
- Pros:
  - We seem to have libraries for doing this already in go that are actually pretty small (https://github.com/libp2p/go-libp2p-http which relies on https://github.com/libp2p/go-libp2p-gostream)
  - Makes it simpler for us to switch to a custom libp2p protocol in the future since we can just figure out which protocols it speaks (e.g. custom, or just http)
  - Only needs an libp2p endpoint, not also a standard HTTP endpoint
  - Gives us client side auth for free, if we want to use it, since we can just check the peerID on the client side of the connection
- Cons:
  - Adds another library dependency to the protocol (may not be available in all languages)
  - Similar britleness/complexity related to when connections break
  - A little less since it's guaranteed that the libp2p connection exists at the time the HTTP Request is issued
  - libp2p endpoint CA issues as in 2
  - No loadbalancing on Puts (as in 2) or Gets

Any of these solutions seem viable, and I'm interested if there are any other proposals out there that I've missed. However, I'm pretty sure we need to do at least one of these things or we're going to have really serious problems with users failing to upload data to pinning services.

It seems like people are not a fan of option 1, which leaves us with 2 and 3. I'm not sure if they're really that different from each other, although I'm currently leading towards option 3 as it's much less hacky and gives us some other nice benefits.

Thoughts?

lidel commented 4 years ago

While I agree "remote pinning over libp2p" is the most elegant thing, and we will most likely have something like that in the future, I don't believe "http over libp2p" is feasible for the mvp at hand:

Dependence on libp2p stack limits the API usefulness to software that can bundle libp2p and run p2p node. For many clients, that level of complexity will be a reason for not supporting this API.
As you noted, we only have relevant libraries for go, and go-ipfs still marking entire feature as experimental. That introduces unnecessary challenges for services that would like to implement this in other languages in time for Filecoin launch.
It also does not solve problem for web based clients, unless app is running js-libp2p node and we re-implement http over libp2p in JS. Same concern about sabotaging adoption and devexp.

Vanilla HTTP API is a hard requirement for the time being. Without it, we won't see community/partner adoption.

Q: Can we solve the problem with HTTP alone?

I'd like us to look into ways we can improve content routing while keeping HTTP API. I believe we implemented (1) in #14 already (entire Pin object is now sent to pinning service).

Q: would simple peerid/multiaddr hints be enough?

What if:

client sends own peerid/multiaddrs in Pin.meta[provider] so Pinning Service can try connecting to a known provider in parallel to asking DHT
pinning service returns peerid/multiaddrs in PinStatus.meta[receiver] so client can try connecting to a designated receiver node?

Sending and acting on these hints would be optional, but pinning from apps with known peerid such as go-ipfs could leverage those hints to ensure peering is in place and data transfer starts immediately.

@aschmahmann Would this be good enough for support in go-ipfs?

aschmahmann commented 4 years ago

I believe we implemented (1) in #14 already

Not quite since we don't send the entire DAG, just the top level pointers.

client sends own peerid/multiaddrs in Pin.meta[provider] so Pinning Service can try connecting to a known provider in parallel to asking DHT

doesn't really help if the client is undialable which is the case I'm concerned about

pinning service returns peerid/multiaddrs in PinStatus.meta[receiver] so client can try connecting to a designated receiver node?

This will work and is basically an easier to deal with version of option 2 👍 (e.g. the pinning services don't have to know keep their PeerIDs long term). Given that the vanilla HTTP API is mandatory for the time being this seems like our only real option and should work reasonably well.

lidel commented 4 years ago

Great. We need a mini spec for those hints. Would an array of string multiaddrs be enough?

aschmahmann commented 4 years ago

Yep, that should be fine

lidel commented 4 years ago

ipfs / pinning-services-api-spec