AutoNAT should correlate dialback results with actual incoming connections

raulk commented 5 years ago

Right now it's pretty trivial to lie to an AutoNAT client by reporting incorrect dialback results. We should register a Notifee and track incoming connections when a dialback is requested, so we can correlate an OK result with an actual observed incoming connection. This makes it more difficult for enemies to confuse us.

raulk commented 5 years ago

Risk of not doing this: an attacker could lead us to believe we are public when we aren't, therefore jeopardising inbound connectivity.

vyzo commented 5 years ago

this is rather complicated...

raulk commented 5 years ago

@vyzo care to elaborate?

raulk commented 5 years ago

My understanding is that this is easy to achieve. You register a Notifee that ignores all events unless you're undergoing AutoNAT determination. At that point, you track all incoming connections, and when the peer has responded positively or negatively, you check to see if their answer is coherent with what you observed:

if they respond negatively, you should've received no inbound connection from that peer ID.
if they respond positively, you should've.

vyzo commented 5 years ago

It's not that simple. The AutoNATService peer uses a background host to dial back, so the peer ID is unknown. The best we can do is correlate the IP address, but that's error prone and very ugly to program. I am inclined to mark this as WONTFIX.

raulk commented 5 years ago

Ah, understood the complexity now. I was sure I was missing something. I agree the IP address is an unreliable heuristic. I wonder if we can have the server open a stream and sign a message with its real identity, so that the client can do the matching. I really think we need to solve this one way or another.

aarshkshah1992 commented 4 years ago

@raulk @yusefnapora

Can we solve this by making the AutoNAT server send the dial response as a signed peer record where the public key is the one from which the peerID of the host we asked for a dialback to was derived ?

aarshkshah1992 commented 4 years ago

The above solution isn't a solution to this problem. The problem we want to solve is:

"We want to be absolutely sure that the AutoNAT server did indeed dial us before sending us a dial response & isn't just faking it"

raulk commented 4 years ago

One simple way to solve what @vyzo pointed out is for the requesting peer to send a nonce in the request, and have the responding peer return a certificate of its dialback host’s identity. It would return the peer ID, public key, and a signature of pubkey || nonce. This is simple to implement and almost stateless. We’d need to hook in a connection notifee, and everytime we request a dial back, we enable tracking of inbound connections, then correlate when the peer responds to us whether we indeed received the connection they claim to have made.

This makes the system more Byzantine Fault Tolerant. If we don’t implement this, a DHT client could be trivially misled into thinking it’s diallable, and would attempt to join the DHT as a server.

aarshkshah1992 commented 4 years ago

Note:

Even after we finish this, an AutoNAT server can still falsely tell a client that it's NAT status is private.

Stebalien commented 4 years ago

Can't an attacker just tell us the wrong addresses? This may help, a little, in some cases, but I want to make sure it's worth the extra complexity.

Also note: forcing the dial to complete means we can't optimize the dial later. In an ideal world, the AutoNAT server would just (with TLS/QUIC):

Open a connection.
Perform the first half of the handshake where the receiver (AutoNAT client) authenticates.
Drop the connection.

This saves the AutoNAT server from having to do any fancy crypto beyond computing the initial DH params, making this service significantly more efficient.

This is simple to implement and almost stateless. We’d need to hook in a connection notifee, and everytime we request a dial back, we enable tracking of inbound connections, then correlate when the peer responds to us whether we indeed received the connection they claim to have made.

It's a little tricker than that.

We may learn about the dial completing after the autonat server has finished sending their response.
Technically, we may never learn about the dial completing because the autonat server may learn about it first, then kill the connection before we see the dial complete.

If we do go with this, I'd like to avoid unnecessary crypto. Instead of a per-request nonce, we should just let the AutoNAT server sign their main key with their dialer/testing key once up-front.

aarshkshah1992 commented 4 years ago

Can't an attacker just tell us the wrong addresses?

It can but it would also require the attacker to do some POW in the form of signing the nonce & thus isn't free. We should also validate that the returned address is among the ones we asked it to probe. I don't think we do it right now.

If we do go with this, I'd like to avoid unnecessary crypto. Instead of a per-request nonce, we should just let the AutoNAT server sign their main key with their dialer/testing key once up-front.

So, if we don't have the dialerId for an AutoNAT server, we should ask the server for a certificate & then send the dial request ? We would still have to match the dialerID with incoming connections & face the races that you mention.

I agree with everything else.

aarshkshah1992 commented 4 years ago

@Stebalien

Also, note that there are ways to solve the races that you mention.

It's a little tricker than that.

We may learn about the dial completing after the autonat server has finished sending their response. Technically, we may never learn about the dial completing because the autonat server may learn about it first, then kill the connection before we see the dial complete.

We could modify the protocol to roughly do something like:

Client connects to the Server and asks for the Identity certificate -> Server sends a signed Identity certificate so we can start tracking the dialer -> Client asks the Server to go ahead with the dial -> When the Client receives the inbound dial, it sends back a nonce on the same connection -> Server echoes back the nonce in the dial back response.

It wouldn't be cheap though and I haven't thought of the things that can go wrong here. We would ALSO still face the dial optimisation problem you mention.

aarshkshah1992 commented 4 years ago

ping @raulk to address @Stebalien's concerns.

Stebalien commented 4 years ago

It wouldn't be cheap though and I haven't thought of the things that can go wrong here.

That's my concern.

Note: If we can ensure that AutoNAT peer selection is actually random (e.g., by querying the DHT for a random set of peers as suggested by @petar), we can make this attack really hard to pull off.

aarshkshah1992 commented 4 years ago

@petar Please can you elaborate on the approach @Stebalien is talking about ? Are we talking of using the DHT to "discover" peers that provide the AutoNAT service ?

petar commented 4 years ago

I am guessing @Stebalien is referring to a discussion we had in person about discovering whether a node is behind a NAT. The problem that @Stebalien pointed out: If the peer you are talking to is behind the same NAT (e.g. both of you are on the same private network), then you would conclude that you are not behind a NAT. I proposed that if you lookup a random peer ID on the DHT and use them to discover whether you are behind a NAT, the chosen peer will not be in your private network (with high probability) and so you will be able to make an accurate determination.

Stebalien commented 4 years ago

Note: My point here is that that solution would also help protect us (somewhat) against sibyls because we'd be choosing the nodes to test instead of just using the first ones we come across.

raulk commented 4 years ago

@Stebalien I'm not following the line of thinking that leads to stalling here. The mechanism proposed here is a strict improvement over the status quo.

Just to be clear, the scope of this issue is not to suddenly make us 100% byzantine fault tolerant (if that is even possible), but rather to make us a little more intelligent. Let's take it step by step.

The first step is to not be entirely gullible. Right now, we just believe what our peer is telling us, every time. Correlating what we observe with what our peer tells us is, IMO, common sense. This would harden the private => public transition. If we consider ourselves private, and a peer tells us we're public, we should've seen an inbound dial. If not, that peer is misleading us.

The risk of not performing this correlation is that it would be relatively easy to conduct a sybil attack where AutoNAT peers unconditionally report public reachability (without even performing the promised dial), and therefore trigger downstream effects, such as having everybody join the DHT (barring local conditions in those protocols).

Let me address your comments individually, in follow-up comments.

raulk commented 4 years ago

@Stebalien

Can't an attacker just tell us the wrong addresses?

AutoNAT does not allow us to learn own addresses.
AFAIK, that's identify.
With AutoNAT we can confirm whether we're truly diallable on those addresses. So what could happen is that I send a list of candidates, and the attacker dials me, it succeeds on address A1, I correlate the inbound dial successfully, but the attacker reports A2 as being the diallable one, that's misleading me.
To mitigate such attacks, the current implementation draws N observations before making a state transition.
An attacker would need to orchestrate a bunch of nodes, and actually perform dials (if we implement this issue). That raises the cost of attack.
For non-diallable nodes, such an attack is difficult to carry out, since they are not connectable they cannot be directly targeted.
- The attacker would have to rely on flooding the network with sybils, such that other protocols (e.g. DHT) would make use of those sybils, and in the process, perform AutoNAT requests.
For diallable nodes, such an attack would be easy to carry out through eclipsing. It requires very few sybils.
- Mitigation: We can curtail this attack by requiring observations to come from a mixture of inbound and outbound nodes, e.g. 50/50.
- That would increase the cost of performing this attack, because the attacker would need to somehow insert themselves as nodes we've dialled to.

FWIW, this is an entirely different attack than the one this issue aims to thwart. Suggestion: track in another issue.

raulk commented 4 years ago

@Stebalien

Also note: forcing the dial to complete means we can't optimize the dial later. In an ideal world, the AutoNAT server would just (with TLS/QUIC):

AFAIK, the dial already completes entirely (crypto handshake and stream muxer negotiation included), this is a no-op.
The operative word in your answer is ideal. Let's not let perfect stand in the way of better.

Suggestion: let's open another issue to track this concern.

raulk commented 4 years ago

@Stebalien

This saves the AutoNAT server from having to do any fancy crypto beyond computing the initial DH params, making this service significantly more efficient.

This honestly sounds like premature optimisation. I do not expect AutoNAT to incur in a vast amount of dials such that it would make this observable. I think the global footprint of this overhead is negligible. It could be network-wide uneven if we have too few AutoNAT servers and too many AutoNAT clients (i.e. the servers are overloaded), but if we're moving to a true p2p model (where all publicly diallable nodes operate as AutoNAT servers), I expect the global load to be a lot more distributed.

For perspective, comparatively, I expect DHT queries to perform a lot more dials (and in a spiky fashion) than AutoNAT. So alleviating the crypto handshake would benefit the DHT protocol a lot more than AutoNAT IMO.

Suggestion: track elsewhere, at the go-libp2p level probably.

raulk commented 4 years ago

@Stebalien

If we do go with this, I'd like to avoid unnecessary crypto. Instead of a per-request nonce, we should just let the AutoNAT server sign their main key with their dialer/testing key once up-front.

That's fine. I suggested a per-request opaque and stateless nonce because I considered it more secure. It makes the server work a little to prove that the dialback is in response to a given request, but that might be superfluous and wouldn't award much extra security.

marten-seemann commented 3 years ago

I'm late to the party, but this is something we might want to pick up at some point, so here's my proposal.

Note: If we can ensure that AutoNAT peer selection is actually random (e.g., by querying the DHT for a random set of peers as suggested by @petar), we can make this attack really hard to pull off.

Agreed, that sounds like a good idea. In my opinion, this should be part of a multi-layered defense, i.e. we should still fix the underlying vulnerability.

I think we can simplify the various suggestions quite a bit and get rid of all additional crypto (no signing, no encrypting) altogether, if we're willing to pay the price of the libp2p handshake. First of all, I think establishing a connection acceptable because:

As @raulk points out, we'll have a lot of AutoNAT servers in the network, and we only need a few dial backs to confirm an address, so the load of this protocol will be pretty low.
For QUIC addresses, the only (non-hacky) way to check for connectivity is by completing the handshake anyway.
If we're willing to pay that price for QUIC addresses, it's probably not worth optimizing the algorithm for TCP addresses.

The protocol I'm suggesting is a simple 2-step protocol:

The initiator requests a dial back for one multiaddr (requests for other multiaddrs may be sent in separate requests). Included in this request is a random ID. The initiator keeps a (frequently gc'ed) map of ID => multiaddr.
The receiver dials the connection to that multiaddr, and sends back the ID. Receiving the identifier adds confidence that the address is actually reachable. Note that there's no need to transmit the multiaddr in this step.

If the nonce is chosen from a large enough space (a uint64 should provide plenty of space for this purpose), collisions are sufficiently unlikely.

Possible attack: There's no way to actually prove that the receiver actually dialed the address contained in the request to send a certain identifier. An attacker could wait for a request, and transmit the identifier one a connection dialed to a different address, falsely leading the initiator to believe that the requested address is actually reachable. I don't see any defense against this attack, other than randomly selecting the peers.

Stebalien commented 3 years ago

For QUIC addresses, the only (non-hacky) way to check for connectivity is by completing the handshake anyway.

Really? Isn't it possible to connect to a QUIC endpoint, receive their side of the handshake, then kill the connection before authenticating?

Stebalien commented 3 years ago

Note: your proposal sounds reasonable, and I guess my previous comment might fall under "hacky".

Possible attack: There's no way to actually prove that the receiver actually dialed the address contained in the request to send a certain identifier. An attacker could wait for a request, and transmit the identifier one a connection dialed to a different address, falsely leading the initiator to believe that the requested address is actually reachable. I don't see any defense against this attack, other than randomly selecting the peers.

Eh, there's no going around this really.

marten-seemann commented 3 years ago

Really? Isn't it possible to connect to a QUIC endpoint, receive their side of the handshake, then kill the connection before authenticating?

There's the Retry mechanism, which is designed for the server to validate return routability to the client's address. It's extremely lightweight, as it doesn't even require decryption of the packet, but for the client there's no reliable way to trigger a Retry packet. A client could also abort the handshake right after receiving the server's TLS certificate, but at this point, the computationally expensive part of the handshake is already over. Anyway, both methods would require modifications to the QUIC stack, which is what I meant by "hacky".

Note: your proposal sounds reasonable

We need to decide if we keep the protocol ID constant (and add fields to the protobufs), or bump the version number of this protocol. As this is quite a significant deviation from what we have so far (in terms of wire encoding, logic and security properties), I'm leaning towards bumping the version number, and doing a phased upgrade.

Stebalien commented 3 years ago

Yes, I think we'd need to bump the protocol version.

libp2p / go-libp2p

AutoNAT should correlate dialback results with actual incoming connections #1480