libp2p / specs

Technical specifications for the libp2p networking stack
https://libp2p.io
1.56k stars 273 forks source link

libp2p + HTTP #477

Closed MarcoPolo closed 1 year ago

MarcoPolo commented 1 year ago

We've discussed this a lot, and I think it's a good idea for the points listed in the spec. I've written the first draft here along with a PoC implementation in go-libp2p: https://github.com/libp2p/go-libp2p/pull/1874.

This unlocks a lot of cool use cases. For example with this you can put a dumb HTTP cache in front of your libp2p node and easily scale the amount of requests you can handle. You can integrate directly with existing CDN infrastructure (what if the provider record said something like /dns4/my-bucket.r2.dev/tcp/443/tls/http and you seamlessly fetched it with your existing libp2p request/response protocol).

Note: This came from discussions that happened around IPFS camp. This work is currently not on the roadmap, so this will be prioritized below unless we (as a community) decide otherwise.

MarcoPolo commented 1 year ago

I also wanted to make sure you had seen this document from @aschmahmann .

Yup! but this isn't public, so I didn't link.

aschmahmann commented 1 year ago

Thanks for the link ping @BigLep and thanks @MarcoPolo for pushing this forward. As anyone who looks at that document will see, it describes some of the options and alternatives and tradeoffs here that are probably worth fleshing out.

Some of the notable ones are: 1) What to do about parts of HTTP that are problematic in very popular environments (e.g. browsers). You can see this problem in action in https://github.com/ipfs/specs/pull/332 for more context. Maybe the answer is to just ignore the problem, but that could cause it's own issues as well. e.g. if I really wanted HTTP Trailers support in browsers I could do HTTP-over-WebTransport instead of using the browser-native HTTP, but I'd need to let the libp2p implementation know that. 2) What to do about PeerIDs (client and server)? In a lot of places users may be considering HTTP usage they won't have control over the certificate generation to add custom libp2p components

aschmahmann commented 1 year ago

@MarcoPolo try here. Notion has a crazy system where even if data is already public it needs to have a special link.

MarcoPolo commented 1 year ago
  1. What to do about parts of HTTP that are problematic in very popular environments (e.g. browsers).

I don't think this is a libp2p problem. This is a general problem with HTTP support in browsers. I would consider this out of scope for this spec. If this is really needed maybe the API should allow users to specify only_over_streams=true, but that seems like a hack and it would be better if the how HTTP got transmitted was separate from the HTTP protocol.

  1. What to do about PeerIDs (client and server)? In a lot of places users may be considering HTTP usage they won't have control over the certificate generation to add custom libp2p components

I've thought a lot about this but purposely didn't include it in this spec to keep this one small and add extensions later. Now I'm thinking I should add a discussion section about this as future work. What I'm thinking is:

  1. If you can have a custom certificate with the libp2p extension that's ideal since we already have a spec for that and it just works today. In practice this means you can't use auto-generated certs from service like github pages, s3, etc. But you can run an nginx cache in front of some static content. And, if you do get a signed cert with this extension, then you can even use this on cloudfront in front of S3. The point being, we should support this use case because it already works.
  2. For the other cases where you don't have control over the certificate generation OR you can't even get access to the certificate (browsers), we can authenticate in a different way. One idea is we add a Noise handshake just like we do in WebTransport. This would allow browser clients to authenticate the peer id of the server and servers to authenticate the peer id of the client without a custom TLS certificate. This does require an endpoint that can execute this handshake, so it wouldn't work in the GitHub pages example, but it would work in the Cloudfraont+S3 example where you would redirect this endpoint to a lambda that does the handshake.

    a. Another Idea here is that the URL contains a subdomain that is a signature of the rest of the domain (stuff to the right of the subdomain) signed by the peer id. For example, aaabbbccc.marcopolo.io would be my URL where aaabbbccc is the signature of marcopolo.io signed by my peerID's private key . If clients see that the TLS cert is signed by a valid root CA and the signature is correct for a given peer id, then they can assume that the owner of the peer id key is the same as the owner of the domain name. This enables the GitHub pages use case, and the client will still be authenticate the peer. The server will not be able to authenticate the peer in this case, but this case is only for static content servers so they don't care about authenticating a peer anyways. This is slightly different security properties than what we normally get, but I don't think this is worse than the standard chain of trust model on the web today.

  3. Step 2 still requires some way to define Request Response protocols, and that's the main goal of this PR.

There's definitely more to expand on above, but I hope that's a good overview of what this spec enables, other use-cases we could enable with future specs, and why this is the first step.

hsanjuan commented 1 year ago

I'm a bit confused about this doc. I think it may make sense for people that have discussed in person but I'm missing something.

What loses me is the part about automatically choosing whether to send your HTTP request normal, or to tunnel it via an existing stream. I can't make sense of that. If the server is an http server, why would a client ever bother tunneling. If the server is a libp2p server, why would it bother exposing over http? If the server needs to serve both regular http clients and libp2p-enabled ones, why would it need involve libp2p for the plain-http part? Client+server certificates to identify peers over http and noise handshakes? How do load-balancing and caching and all the HTTP goodies work then?

So I'm pretty lost. I think it might be good to have some sort of http transport for libp2p (and before that, request/response semantics I guess), but this is somewhat unrelated to what go-libp2p-http does already. And I'm not even 100% it makes sense, given websockets is http already etc.

At least I would like to hear ideas beyond the echo example.

MarcoPolo commented 1 year ago

@hsanjuan thanks for taking a look and the great questions. I'll incorporate more details in the doc, but let me give a couple high level answers:

I think maybe part of the confusion is that when folks say "HTTP" they usually mean the "HTTP protocol on top of a TCP (+ TLS)" connection. In this spec when I say "HTTP" I mean only the HTTP request response protocol. This protocol can run both on top of a libp2p stream (what go-libp2p-http does) or it can run on top of a plain old TCP+TLS connection (standard https traffic).

There is talk about having request/response constructs in libp2p that could easily use http as transport (Adin)

This is one of the main points of this spec. Instead of coming up with a request/response protocol that we can then easily put on top of HTTP, we should just use HTTP. There's no benefit of reinventing the wheel here.

There is talk about tunneling http under libp2p (what go-libp2p-http does)

Yes. I think this is good, and this spec is to standardize what go-libp2p-http does.

If the server is an http server, why would a client ever bother tunneling.

It wouldn't. If you connect to a node with only a multiaddr of /dns4/example.com/tcp/443/tls/http and start a request response protocol, there is going to be no tunneling. libp2p will make an https request just like curl would or Go's *http.Client (it will include the client certificate as per the libp2p tls spec, so the server if it wanted to could authenticate the peer).

If the server is a libp2p server, why would it bother exposing over http?

It doesn't have to expose an https server (HTTP + plain TCP+TLS). It could only support HTTP on top of libp2p streams (just like go-libp2p-http). The reason to expose HTTP on top of libp2p streams is because we would be using HTTP as our request/response protocol.

If the server needs to serve both regular http clients and libp2p-enabled ones, why would it need involve libp2p for the plain-http part?

It doesn't have to, but it may be convenient since the logic is already there.

Here's an example: Assume I have a simple protocol called multihash-bucket. When someone makes an HTTP GET request to /multihash-bucket/1.0.0/<multihash> this returns either a 404 if we don't data that hashes to that multihash or returns a blob of data that hashes to that multihash with status code == 200. I can define my handlers such that this protocol works on on both libp2p streams and a plain TCP+TLS connection both using HTTP without me doing anything extra besides using libp2p. Later, I want to add some HTTP caching, so I put an nginx cache in front of my server that caches my HTTP responses from the plain TCP+TLS connection, and that just works. Nginx of course can't authenticate the peer, but that's fine since I don't care about authenticating the client in this case. The client can still authenticate the server because ngnix would use the same TLS certificate (or something that has the libp2p x509 extension signed by the same peer id).

Maybe later, I want to stop serving data blobs over libp2p-streams because I can't cache these as well. So I change my multihash-bucket protocol to support redirects to a different multiaddr owned by the same peer. Then I can redirect my libp2p traffic to utilize my nginx cache. If I want to add an HTTP load balancer, this will work the same as the cache. The load balancer just needs to present the TLS certificate with the libp2p extension to the client. The load balancer could forward the client certificate to the backend server if the backend server needs to authenticate the peer.

Now why did I even bother with the libp2p part in the first place? Wouldn't it have been better to just use an HTTP server and skip libp2p entirely?

You could do this, but you lose out on a couple benefits:

  1. Some protocols like this could be a protocol with first class support like Gossipsub, bitswap, kademlia, so you would get it for free with libp2p.
  2. The client wants to use this code as well and they aren't a server. They may also want to get blobs from other peers that also aren't servers. They wouldn't be able to do this with a plain HTTPS connection. libp2p is a peer-to-peer library after all!
  3. You could publish to the DHT or indexers, and clients can parse this and know how to reach you because we've standardized what the HTTP transport and multiaddr look like and how it works (this spec).

So I'm pretty lost. I think it might be good to have some sort of http transport for libp2p (and before that, request/response semantics I guess)

This spec is partly to define the request/response semantics part (just use HTTP). Then the http transport, (which will only work for request/response protocols) you get for free-ish.

And I'm not even 100% it makes sense, given websockets is http already etc.

This doesn't conflict with websockets. Sometimes you want a request/response protocol rather than a stream based one.


Here are some interesting usecases for libp2p + HTTP (longer term):

  1. An IPFS client will pin data by saving it to S3 and use an AWS lambda to republish it to the DHT every 24 hours. (replace with your favorite cloud provider's offering)

  2. A filecoin storage provider delivers content to users by leveraging various CDNs. For example, if some piece of content is popular they could leverage whatever CDN provider is cheapest at the time and tell the indexer the content is now there. This lets the storage provider conserve their bandwidth and still deliver the content to users.

  3. Some new data transfer protocol that uses request/response is load balanced across N servers using a standard HTTP load balancer.

  4. Some new data transfer protocol that uses request/response saves 2 round trips by being able to use HTTP over libp2p streams (compared to a standard HTTPS connection) if we happen to already be connected to this peer (via gossipsub, kademlia, or mdns for example).


This spec is the first part of this long term goal. Right now we don't have a standarized request/response abstraction in libp2p. Folks have been using go-libp2p-http (including myself when I worked on the indexer), and I think that's great. The spec defines HTTP as the request/response protocol for libp2p. Then it defines an HTTPS transport which only works for request/response protocols, and, hey, that just HTTP, so this looks just like a normal HTTPS request/response.

The next step is to figure out how to support servers that can't have a custom TLS certificate with the libp2p x509 extension. That's where the noise handshake could come in. But this is off-topic for the core of this spec (but happy to discuss).

Phew! that was more than I intended to write. I hope that helps clarify things.

hsanjuan commented 1 year ago

Thanks for taking the time to explain!

go-libp2p-http has been using urls like : libp2p://<peerID>/path/to/content. Is this to be kept so that if I have a <peerID> with an address like /dns4/example.com/tcp/443/tls/http then it will plain-HTTP to it? This requires clients to understand those URLs though, it also assumes a single identity (if we load balance, would the LB present a single identity or multiple?).

Should a client instead use https://example.com/path/to/content, and then resolve example.com via dnslink to find peerIDs instead and then act accordingly? (i.e. transparently upgrade to libp2p stack, but being able to keep urls in the same way that anyone understands them).

I think the case of having an nginx middleware needs to be polished. What type of SSL-termination can we do (sounds like not full termination?), what identity does nginx have if the client needs to verify identity and nginx is load-balancing multiple servers? In order to take care of caching etc... nginx should be doing layer 7, but that may mean nginx needs to proxy between two encrypted sides (one to the client, one to the server/s), which will result in perf penalty. Also, instead of nginx in particular, which is very configurable, we should think AWS Application Load Balancer. How can we make an ALB work with this?

Finally, I would like to bring up proxy-protocol support https://github.com/libp2p/go-libp2p/issues/1065. I was reminded of it because this idea overlaps in the intention to help operators, enable options for load-balancing and libp2p cross-compatibility with standard tooling, but it is not really what we asked for from the infra side long ago. I'm not quite sure that http-transport is a good answer for what we need to be honest, even though I like it and see its merits. In that sense it is a bit sad that our proxy-protocol support request is effectively abandoned. Should we be lobbying for it more? (we know proxy-protocol is horrible etc, but still).

MarcoPolo commented 1 year ago

Closing this in favor of focusing on #508