Open raulk opened 5 years ago
Risk of not doing this: an attacker could lead us to believe we are public when we aren't, therefore jeopardising inbound connectivity.
this is rather complicated...
@vyzo care to elaborate?
My understanding is that this is easy to achieve. You register a Notifee that ignores all events unless you're undergoing AutoNAT determination. At that point, you track all incoming connections, and when the peer has responded positively or negatively, you check to see if their answer is coherent with what you observed:
It's not that simple. The AutoNATService peer uses a background host to dial back, so the peer ID is unknown. The best we can do is correlate the IP address, but that's error prone and very ugly to program. I am inclined to mark this as WONTFIX.
Ah, understood the complexity now. I was sure I was missing something. I agree the IP address is an unreliable heuristic. I wonder if we can have the server open a stream and sign a message with its real identity, so that the client can do the matching. I really think we need to solve this one way or another.
@raulk @yusefnapora
Can we solve this by making the AutoNAT server send the dial response as a signed peer record where the public key is the one from which the peerID of the host we asked for a dialback to was derived ?
The above solution isn't a solution to this problem. The problem we want to solve is:
"We want to be absolutely sure that the AutoNAT server did indeed dial us before sending us a dial response & isn't just faking it"
One simple way to solve what @vyzo pointed out is for the requesting peer to send a nonce in the request, and have the responding peer return a certificate of its dialback host’s identity. It would return the peer ID, public key, and a signature of pubkey || nonce
. This is simple to implement and almost stateless. We’d need to hook in a connection notifee, and everytime we request a dial back, we enable tracking of inbound connections, then correlate when the peer responds to us whether we indeed received the connection they claim to have made.
This makes the system more Byzantine Fault Tolerant. If we don’t implement this, a DHT client could be trivially misled into thinking it’s diallable, and would attempt to join the DHT as a server.
Note:
Even after we finish this, an AutoNAT server can still falsely tell a client that it's NAT status is private.
Can't an attacker just tell us the wrong addresses? This may help, a little, in some cases, but I want to make sure it's worth the extra complexity.
Also note: forcing the dial to complete means we can't optimize the dial later. In an ideal world, the AutoNAT server would just (with TLS/QUIC):
This saves the AutoNAT server from having to do any fancy crypto beyond computing the initial DH params, making this service significantly more efficient.
This is simple to implement and almost stateless. We’d need to hook in a connection notifee, and everytime we request a dial back, we enable tracking of inbound connections, then correlate when the peer responds to us whether we indeed received the connection they claim to have made.
It's a little tricker than that.
If we do go with this, I'd like to avoid unnecessary crypto. Instead of a per-request nonce, we should just let the AutoNAT server sign their main key with their dialer/testing key once up-front.
Can't an attacker just tell us the wrong addresses?
It can but it would also require the attacker to do some POW in the form of signing the nonce & thus isn't free. We should also validate that the returned address is among the ones we asked it to probe. I don't think we do it right now.
If we do go with this, I'd like to avoid unnecessary crypto. Instead of a per-request nonce, we should just let the AutoNAT server sign their main key with their dialer/testing key once up-front.
So, if we don't have the dialerId for an AutoNAT server, we should ask the server for a certificate & then send the dial request ? We would still have to match the dialerID with incoming connections & face the races that you mention.
I agree with everything else.
@Stebalien
Also, note that there are ways to solve the races that you mention.
It's a little tricker than that.
We may learn about the dial completing after the autonat server has finished sending their response. Technically, we may never learn about the dial completing because the autonat server may learn about it first, then kill the connection before we see the dial complete.
We could modify the protocol to roughly do something like:
Client connects to the Server and asks for the Identity certificate -> Server sends a signed Identity certificate so we can start tracking the dialer -> Client asks the Server to go ahead with the dial -> When the Client receives the inbound dial, it sends back a nonce on the same connection -> Server echoes back the nonce in the dial back response.
It wouldn't be cheap though and I haven't thought of the things that can go wrong here. We would ALSO still face the dial optimisation problem you mention.
ping @raulk to address @Stebalien's concerns.
It wouldn't be cheap though and I haven't thought of the things that can go wrong here.
That's my concern.
Note: If we can ensure that AutoNAT peer selection is actually random (e.g., by querying the DHT for a random set of peers as suggested by @petar), we can make this attack really hard to pull off.
@petar Please can you elaborate on the approach @Stebalien is talking about ? Are we talking of using the DHT to "discover" peers that provide the AutoNAT service ?
I am guessing @Stebalien is referring to a discussion we had in person about discovering whether a node is behind a NAT. The problem that @Stebalien pointed out: If the peer you are talking to is behind the same NAT (e.g. both of you are on the same private network), then you would conclude that you are not behind a NAT. I proposed that if you lookup a random peer ID on the DHT and use them to discover whether you are behind a NAT, the chosen peer will not be in your private network (with high probability) and so you will be able to make an accurate determination.
Note: My point here is that that solution would also help protect us (somewhat) against sibyls because we'd be choosing the nodes to test instead of just using the first ones we come across.
@Stebalien I'm not following the line of thinking that leads to stalling here. The mechanism proposed here is a strict improvement over the status quo.
Just to be clear, the scope of this issue is not to suddenly make us 100% byzantine fault tolerant (if that is even possible), but rather to make us a little more intelligent. Let's take it step by step.
The first step is to not be entirely gullible. Right now, we just believe what our peer is telling us, every time. Correlating what we observe with what our peer tells us is, IMO, common sense. This would harden the private => public transition. If we consider ourselves private, and a peer tells us we're public, we should've seen an inbound dial. If not, that peer is misleading us.
The risk of not performing this correlation is that it would be relatively easy to conduct a sybil attack where AutoNAT peers unconditionally report public reachability (without even performing the promised dial), and therefore trigger downstream effects, such as having everybody join the DHT (barring local conditions in those protocols).
Let me address your comments individually, in follow-up comments.
@Stebalien
Can't an attacker just tell us the wrong addresses?
FWIW, this is an entirely different attack than the one this issue aims to thwart. Suggestion: track in another issue.
@Stebalien
Also note: forcing the dial to complete means we can't optimize the dial later. In an ideal world, the AutoNAT server would just (with TLS/QUIC):
Suggestion: let's open another issue to track this concern.
@Stebalien
This saves the AutoNAT server from having to do any fancy crypto beyond computing the initial DH params, making this service significantly more efficient.
This honestly sounds like premature optimisation. I do not expect AutoNAT to incur in a vast amount of dials such that it would make this observable. I think the global footprint of this overhead is negligible. It could be network-wide uneven if we have too few AutoNAT servers and too many AutoNAT clients (i.e. the servers are overloaded), but if we're moving to a true p2p model (where all publicly diallable nodes operate as AutoNAT servers), I expect the global load to be a lot more distributed.
For perspective, comparatively, I expect DHT queries to perform a lot more dials (and in a spiky fashion) than AutoNAT. So alleviating the crypto handshake would benefit the DHT protocol a lot more than AutoNAT IMO.
Suggestion: track elsewhere, at the go-libp2p level probably.
@Stebalien
If we do go with this, I'd like to avoid unnecessary crypto. Instead of a per-request nonce, we should just let the AutoNAT server sign their main key with their dialer/testing key once up-front.
That's fine. I suggested a per-request opaque and stateless nonce because I considered it more secure. It makes the server work a little to prove that the dialback is in response to a given request, but that might be superfluous and wouldn't award much extra security.
I'm late to the party, but this is something we might want to pick up at some point, so here's my proposal.
Note: If we can ensure that AutoNAT peer selection is actually random (e.g., by querying the DHT for a random set of peers as suggested by @petar), we can make this attack really hard to pull off.
Agreed, that sounds like a good idea. In my opinion, this should be part of a multi-layered defense, i.e. we should still fix the underlying vulnerability.
I think we can simplify the various suggestions quite a bit and get rid of all additional crypto (no signing, no encrypting) altogether, if we're willing to pay the price of the libp2p handshake. First of all, I think establishing a connection acceptable because:
The protocol I'm suggesting is a simple 2-step protocol:
ID => multiaddr
.If the nonce is chosen from a large enough space (a uint64 should provide plenty of space for this purpose), collisions are sufficiently unlikely.
Possible attack: There's no way to actually prove that the receiver actually dialed the address contained in the request to send a certain identifier. An attacker could wait for a request, and transmit the identifier one a connection dialed to a different address, falsely leading the initiator to believe that the requested address is actually reachable. I don't see any defense against this attack, other than randomly selecting the peers.
For QUIC addresses, the only (non-hacky) way to check for connectivity is by completing the handshake anyway.
Really? Isn't it possible to connect to a QUIC endpoint, receive their side of the handshake, then kill the connection before authenticating?
Note: your proposal sounds reasonable, and I guess my previous comment might fall under "hacky".
Possible attack: There's no way to actually prove that the receiver actually dialed the address contained in the request to send a certain identifier. An attacker could wait for a request, and transmit the identifier one a connection dialed to a different address, falsely leading the initiator to believe that the requested address is actually reachable. I don't see any defense against this attack, other than randomly selecting the peers.
Eh, there's no going around this really.
Really? Isn't it possible to connect to a QUIC endpoint, receive their side of the handshake, then kill the connection before authenticating?
There's the Retry mechanism, which is designed for the server to validate return routability to the client's address. It's extremely lightweight, as it doesn't even require decryption of the packet, but for the client there's no reliable way to trigger a Retry packet. A client could also abort the handshake right after receiving the server's TLS certificate, but at this point, the computationally expensive part of the handshake is already over. Anyway, both methods would require modifications to the QUIC stack, which is what I meant by "hacky".
Note: your proposal sounds reasonable
We need to decide if we keep the protocol ID constant (and add fields to the protobufs), or bump the version number of this protocol. As this is quite a significant deviation from what we have so far (in terms of wire encoding, logic and security properties), I'm leaning towards bumping the version number, and doing a phased upgrade.
Yes, I think we'd need to bump the protocol version.
Right now it's pretty trivial to lie to an AutoNAT client by reporting incorrect dialback results. We should register a Notifee and track incoming connections when a dialback is requested, so we can correlate an OK result with an actual observed incoming connection. This makes it more difficult for enemies to confuse us.