libp2p / specs

Technical specifications for the libp2p networking stack
https://libp2p.io
1.59k stars 277 forks source link

Decentralized NAT traversal using nodes in the network #307

Open robertkiel opened 3 years ago

robertkiel commented 3 years ago

As invited by @vasco-santos in https://github.com/libp2p/js-libp2p/issues/870, I'm creating a more detailed overview of HoprConnect, an alternate transport module for js-libp2p handling churn and NAT traversal.

Disclaimer: parts of the documentation are taken from our own documentation and therefore slightly HOPR-flavoured.

Rationale

HoprConnect was created in the context of HOPR as js-libp2p-tcp as well as js-libp2p-webrtc-star did not support automatic NAT traversal or required external resources such as (external) STUN or (external) TURN and the final NAT traversal required some tweaks on the client software. The idea was to encapsulate most of the logic that is required to tunnel consumer routers in a transport module and work on higher-level mechanisms such as packet mixing.

Desired properties

Addressing

HoprConnect uses two kind of addresses:

Socket interfaces

HoprConnect binds to a TCPv4 and a UDPv4 socket, the ports can and are intended to be the same.

UDPv4 is used exclusively for answering STUN request, which means that every node using HoprConnect is also a potential STUN server.

TCPv4 is used for everything else.

IPv6 is foreseen but not yet implemented.

Connection setup

Assume that A intends to talk to B and A knows a few direct address from B as well as some indirect addresses aka relay addresses.

A first tries to contact B using the direct addresses which can fail if the other node is living behind a NAT router. If this works, then the connection is kept.

Otherwise the node tries to connect to one of the given relays by using the indirect addresses. Once the connection to the relay is established, the node asks the relay to establish a connection to the final destination, B. The relay tries to contact the requested node and answers with OK if successful or FAIL_COULD_NOT_REACH_COUNTERPARTY if not accessible. If the destination could not be reached by the relay, the node tries a different relay and if there is none, the connection attempt is aborted.

Once the relayed connection is established, the node starts exchanging payload data with the destination. At the same time, both nodes, A and B initiate a WebRTC connection and check whether A and B can connect directly. If a direct connection is possible, the relayed connection is transparently replaced by a direct connection.

HOPR-Connect architecture

Reconnects

Reconnects between direct connections such as TCP and WebRTC instances are handled automatically and mostly transparently by the operating system and WebRTC.

For relayed connections, this need to be handled explicitly because nodes do not get that kind of feedback from the other nodes automatically. More precisely, the node on one end of the relay stays unaware of happenings on the other end as long as the information is not actively forwarded.

HoprConnect implements this behavior by giving feedback to the sender of the message whether it has been successfully forwarded or not. If this message cannot be forwarded then the connection is paused until the node reconnects. Note that the relay does not cache the messages, it just tells the sender to stop sending and reject the reception.

The connection stays “half-open” until the node on the other side reconnects and thereby overwrites the existing connection. Once that happens, the relay injects a RECONNECT message into the message stream, notifying the other party about the necessity to restart the encryption layer.

Once the relayed connection is established, the both nodes do exactly the same as when establishing a "normal" connection: they start a WebRTC instance at both ends of the connection and checkout whether they can connect directly and transparently switch to a direct WebRTC connection if that is possible.

HOPR-Connect architecture

Bootstrapping

Once a node is started, it first tries to detect its own public IPv4 address by using any node in the network to answer its STUN request.

The following is WIP

Afterwards, it tries to connect to known relay nodes and announce to other nodes behind which nodes it is available.

WIP End

Comparison with other NAT traversal techniques

Potential browser-to-browser extension

The relay code is kept pretty agnostic where the connection comes from, which means that it can easily accept a HTTP(S) or even a WebSocket (Secure) stream and feed this stream into another stream on the other side of the relay. The missing part here is a browser implementation that establishes a relayed HTTP or WebSocket stream with one of the relay nodes and then transparently replace it with direct WebRTC connection if this is possible, otherwise it should keep the relayed connection.

mxinden commented 3 years ago

@robertkiel I am sorry for the delay here. I will follow up later today or tomorrow.

mxinden commented 3 years ago

Thanks for bearing with us and thanks for the detailed post above.

First off, providing a (decentralized) way for nodes behind NATs and firewalls to connect, cross platforms (browser, Node, Golang, Rust, ...) is something we are very much interested in and also working on today. Thus I am happy to see your proposal.

Questions

I have a couple of follow up questions:

Project Flare

Project Flare will allow non-browser to non-browser NAT hole punching on TCP and QUIC. It will use circuit relay v2 to relay the coordination protocol. The task of STUN is done via AutoNat.

WebRTC

Project Flare as it is designed today won't work in browsers. One can not control the TCP or UDP (for QUIC) sockets directly, thus can't require port-reuse. One can not directly connect to non-ssl-protected endpoints. Requiring all non-browser nodes to offer ssl is hard to say the least, but even then browser-to-browser won't work.

As far as I can tell the only way forward to fully support browsers is through WebRTC. WebRTC support for non-JS is in progress, e.g. see go-libp2p-webrtc-direct. In addition there is a spec proposal for WebRTC signaling

Proposal for future steps

To deduplicate work but also to not fragment the ecosystem, I think it is very much worth the effort, to synchronize any future work. Off the top of my head I see two things:

  1. Settle on a common WebRTC specification, see https://github.com/libp2p/specs/issues/220.

  2. Instead of a custom relay protocol, we should collaborate together on circuit relay v2 (specification yet to be written). See also this discussion which as well proposes a shared signaling protocol across transports (WebRTC, TCP, ...).

  3. Merge the AutoNAT and STUN effort, i.e. have AutoNAT support a subpart of the STUN specification to be used by nodes using WebRTC (browser).

robertkiel commented 3 years ago

Hi @mxinden ,

thanks for your reply!

Why use WebRTC?

First of all, it already exists which is a big benefit because hole-punching is already solved and maintained by the Chrom(ium) team. Same for the encryption system DTLS and the RTP implementation. Also the WebRTC signalling seems to follow some specification and the detection whether a direct connection is possible works quite well.

The only remaining issue was to feed the WebRTC instances with the right messages and transparently handle the TURN fallback in case we cannot connect directly ("WebRTC signalling fails"). This turned out to be quite tricky, especially when considering a decent degree of churn (nodes joining and leaving the network with same or different ip addresses).

Another interesting point that became clear during the development is that WebRTC can be used from a browser to establish direct connections, hence there exists a potential way to have direct browser-to-browser connection after exchanging signalling messages over a different channel.

Custom relay implementation

HoprConnect indeed uses a custom relay connection. It turned out to be a bit unflexible to use js-libp2p's relay connection as it is too much baked into js-libp2p and therefore a bit tricky to control in order to handle fallbacks and connection upgrades such as relayed connection -> direct webrtc connection.

On the other hand, handling fallbacks and reconnects and WebRTC signalling messages made it necessary to inject certain status messages and prefixes to properly multiplex messages. But I'm sure that we can merge both efforts.

Project Flare

Sounds very interesting - the only downside is that neither Node.js nor (all modern) browsers support QUIC directly, so HoprConnect is using plain-old TCP connections to exchange messages. Nevertheless Node.js seems to bring QUIC support soon, currently it is available behind a compile-time flag.

I also noticed that you are developing a custom NAT hole-punching solution which I personally find quite challenging since NAT implementations seem to be quite inhomogenous which makes testing very hard.

Browser-to-Browser

Connections between two browers indeed don't work without any signalling over a relayed connection to exchange hole-punching information. The way that I see is to use non-browser instances that listen to HTTP(S) streams and contact them from the browser using POST requests or WS(S) data connetions to exchange data.

AutoNAT and STUN

The reason for embedding STUN is that WebRTC only supports standard STUN and thus requires a STUN server which is realized in HoprConnect by using a library that binds to a UDP socket and answers STUN requests, so STUN is not really part of the protocol.

mxinden commented 3 years ago

HoprConnect indeed uses a custom relay connection. It turned out to be a bit unflexible to use js-libp2p's relay connection as it is too much baked into js-libp2p and therefore a bit tricky to control in order to handle fallbacks and connection upgrades such as relayed connection -> direct webrtc connection.

I can not comment on the feasibility of using a shared relay implementation in JS, though I strongly believe that we should at least strive for on-the-wire compatibility both between JS implementations and all others (e.g. Golang, Rust, ...). We will likely have a first specification draft of circuit relay v2 in the upcoming weeks. I would very much appreciate your input on the draft to make sure it suits your implementation as well.

I also noticed that you are developing a custom NAT hole-punching solution which I personally find quite challenging since NAT implementations seem to be quite inhomogenous which makes testing very hard.

Correct. Though the test results we have today are very promising both via QUIC and TCP.

robertkiel commented 3 years ago

Let me summarize a bit:

Component HoprConnect Project Flare
Node-to-Node base communication TCP TCP or QUIC
NAT capability detection STUN AutoNAT + UPnP
Relay protocol custom to be specified, see https://github.com/libp2p/go-libp2p-circuit/pull/125/
Hole-punching information exchange protocol WebRTC + JSON DCUtr
Hole-punching WebRTC ?
Connection fallback / upgrade handling custom ?
Encryption layer DTLS over UDP QUIC or TCP-Noise-Yamux

During the implementation of HoprConnect, I've noticed that the relay / fallback / upgrade logic can be very agnostic from the way how NAT traversal is done at the end of the day. Same for the node-to-node communication and capability detection.

I'd therefore suggest the following:

vyzo commented 3 years ago

This matrix is wildly incorrect; autonat is only used to detect whether you are behind a NAT/firewall or not. It does not do NAT capability detection, hole punching, or holepunching coordination. We have a separate protocol for the coordination, called DCUtr -- see https://github.com/libp2p/specs/pull/173

robertkiel commented 3 years ago

This matrix is wildly incorrect; autonat is only used to detect whether you are behind a NAT/firewall or not. It does not do NAT capability detection, hole punching, or holepunching coordination. We have a separate protocol for the coordination, called DCUtr -- see #173

Good to know. I'm not that much into the libp2p ecosystem. Just updated the table accordingly. Could you name the other mistakes?

@vyzo maybe a DM could reduce some misunderstandings?

mxinden commented 3 years ago

@robertkiel I posted an extended version of your table above in https://github.com/libp2p/specs/issues/312. I would appreciate your input, especially in regards to HOPR connect.

I will draft a long term vision sometime soon. That should help us deduplicate efforts and and improve interoperability.