Closed sukunrt closed 2 weeks ago
What's the plan for resolving https://github.com/libp2p/specs/issues/536? Would you open a new PR that targets this PR here?
Yes, I'll open a PR with the changes for #536.
thanks for your review @thomaseizinger. I'd like your opinion on these two issues
Proposal: use a list of addresses in priority order for autonat v2 dial requests #539 Proposal: allow AutoNAT to dial all IP addresses, without risking amplification attacks #536
thanks for your review @thomaseizinger. I'd like your opinion on these two issues
Proposal: use a list of addresses in priority order for autonat v2 dial requests #539 Proposal: allow AutoNAT to dial all IP addresses, without risking amplification attacks #536
I don't have anything to add to those at the moment :)
Thanks for your review @MarcoPolo
It seems like there's a healthy discussion already going on, so I'll step back here and let other folks stay involved. If there's anything I can help with, please don't hesitate to ping.
The suggested strategy is discussed here: https://github.com/libp2p/specs/issues/536 Please check if we've made any errors there or overlooked something.
Here's the PR for those changes: https://github.com/libp2p/specs/pull/542 You can review it there, or here after I merge those changes.
I have a general remark that might be relevant to this spec.
About a month or so ago tested AutoNAT with QUIC (rust-libp2p). A node behind NAT could be dialed back successfully, falsely reporting the status as public. My working theory is that the dial-back happens to the same ip-port of a previously established connection. According to @mxinden on Matrix, in the Go implementation, a different port is supposedly used for the dial-back.
This v2 spec might solve that by allowing to probe for individual addresses but I am not totally sure. Perhaps my concern isn't relevant to this spec and is specific issue with the rust lib, feel free to ignore in that case.
About a month or so ago tested AutoNAT with QUIC (rust-libp2p). A node behind NAT could be dialed back successfully, falsely reporting the status as public. My working theory is that the dial-back happens to the same ip-port of a previously established connection. According to @mxinden on Matrix, in the Go implementation, a different port is supposedly used for the dial-back.
Go uses a separate host for exactly that reason. If Rust dials back on the same 4-tuple, the connection attempt is guaranteed to succeed, rendering AutoNAT useless (actually, worse than useless, since it's reporting an incorrect result). This issue would still exist with AutoNAT v2.
Thanks for that insight, @marten-seemann!
Just to make sure I understand, the crux of the problem is then that the source port used for the DialRequest
is the same as the address that is dialed back by the peer with DialAttempt
? Is this a problem that is specific to UDP/QUIC (in the Rust impl), with TCP using a different source socket to establish an outbound connection (unless port reuse is enabled)?
By the way, Host
is terminology specific to Go? In case of the Rust impl, what would this entail?
As far as I know, rust-libp2p uses reuseport for TCP. In that case, the kernel will not allow you to bind another connection to the same 4-tuple, so I'd expect that rust-libp2p always returns a DIAL_FAILED error, which would also not be correct. I haven't tried it out though.
What @marten-seemann described above for rust-libp2p is correct. Let's move the rust-libp2p specific discussion to https://github.com/libp2p/rust-libp2p/issues/3900. Thanks @marten-seemann and @b-zee.
An implementation is available here: https://github.com/libp2p/go-libp2p/pull/2469/files I have to clean up the PR a bit but the complete spec is implemented.
I read the spec and liked it a lot. I just have a tiny question: Why do we use bandwidth to make the attack costly? Since a server also has to pay for bandwidth. Why are we not using a different mechanism like proof-of-work? Although I get why this is difficult, since the cost of the calculation, say prime factorization, has to exceed the bandwidth cost for the dial of the server. But since we took that from QUIC, I‘m sure it’s fine. I just wanted to know.
Why do we use bandwidth to make the attack costly? Since a server also has to pay for bandwidth. Why are we not using a different mechanism like proof-of-work? Although I get why this is difficult, since the cost of the calculation, say prime factorization, has to exceed the bandwidth cost for the dial of the server. But since we took that from QUIC, I‘m sure it’s fine. I just wanted to know.
I had the same question and @mxinden explained it to me as: You want to make the attacker pay in the same "currency" as what gets used in the attack. Opening a new connection to a different node imposes bandwidth on that node's downlink, even if - in the best case - they drop all incoming packets and thus don't incur any other costs.
If you make an attacker pay with PoW, you'd have to figure out what the conversion-rate between bandwidth and CPU power is to make the attack economically nonsensical. If on the other hand, we make them pay with the same bandwidth they roughly gets sent as part of establishing a new connection, the attacker doesn't gain anything could as well just spent the bandwidth on the target itself without us having to worry what the economic cost relationship between our mitigation attempt and the actual attack is.
Why do we use bandwidth to make the attack costly? Since a server also has to pay for bandwidth. Why are we not using a different mechanism like proof-of-work? Although I get why this is difficult, since the cost of the calculation, say prime factorization, has to exceed the bandwidth cost for the dial of the server. But since we took that from QUIC, I‘m sure it’s fine. I just wanted to know.
I had the same question and @mxinden explained it to me as: You want to make the attacker pay in the same "currency" as what gets used in the attack. Opening a new connection to a different node imposes bandwidth on that node's downlink, even if - in the best case - they drop all incoming packets and thus don't incur any other costs.
If you make an attacker pay with PoW, you'd have to figure out what the conversion-rate between bandwidth and CPU power is to make the attack economically nonsensical. If on the other hand, we make them pay with the same bandwidth they roughly gets sent as part of establishing a new connection, the attacker doesn't gain anything could as well just spent the bandwidth on the target itself without us having to worry what the economic cost relationship between our mitigation attempt and the actual attack is.
Thanks for clearing that up. I especially get that it's hard to figure out this conversion rate, however we can't deny that this will put strain on the servers that provide AutoNATv2 services.
we can't deny that this will put strain on the servers that provide AutoNATv2 services.
Yeah, I am not sure if we really have a choice TBH but @marten-seemann, @sukunrt or @mxinden can probably elaborate more on this design. Given that it came up the 2nd time now, should we add a more detailed explanation of the scheme to the spec?
How is this putting a strain on servers? It’s the server that decides when to require this verification, and for every incoming request, the server has the option to reject it.
How is this putting a strain on servers? It’s the server that decides when to require this verification, and for every incoming request, the server has the option to reject it.
Yes that's true.
I need another clarification: In line 67 it is mentioned that every DialRequest
will be sent on a stream with protocol ID /libp2p/autonat/2/dial
. However in Line 219 it is also said that a DialRequest
is a Message
send on /libp2p/autonat/2/dial-request
. This is confusing to me. Which one is it now?
Or is only the DialRequest
a Message
on /libp2p/autonat/2/dial
and all other Message
, like DialDataRequest
, DialDataResponse
are happening on /libp2p/autonat/2/dial-request
. If that's correct, why use Message
on /libp2p/autonat/2/dial
?
Thanks umgefahren, I've fixed this. It should be /libp2p/autonat/2/dial-request
@umgefahren
Or is only the DialRequest a Message on /libp2p/autonat/2/dial and all other Message, like DialDataRequest, DialDataResponse are happening on /libp2p/autonat/2/dial-request. If that's correct, why use Message on /libp2p/autonat/2/dial
on the stream /libp2p/autonat/2/dial-request
we exchange messages, DialRequest, DialDataRequest, DialDataResponse, DialResponse. Since we exchange multiple types of messages on the stream from both sides, we wrap them in a Message type protobuf so that we can determine which type of message we have received. So a client on receiving a Message from server can determine whether it is a DialDataRequest or is it a DialResponse.
Great, that's clear now.
Another question, just to make sure: There will always be several DialDataResponse
, since the upper bound for the amount of data per message is capped at 4096 bytes and it is required to send between 30kB and 100kB. Did I understand that correctly? That means the amount of data received so far have to be tracked, per connection.
And an additional suggestion: What is the upper bound for the size of Message
, i.e. how many Multiaddress can be send by the client?
Another question, just to make sure: There will always be several DialDataResponse, since the upper bound for the amount of data per message is capped at 4096 bytes and it is required to send between 30kB and 100kB. Did I understand that correctly? That means the amount of data received so far have to be tracked, per connection.
Yes your understanding is correct, there'll be multiple DialDataResponse till you get the requested amount of data. You will have to track the amount of data received so far per dial-request stream. A peer can open multiple streams in parallel. Of course an implementation MAY choose to not allow this, but I don't think it is necessary to make this a MUST.
And an additional suggestion: What is the upper bound for the size of Message, i.e. how many Multiaddress can be send by the client?
I will think about this. Currently this is only limited by the size of the Message which again is implementation dependent. I think most implementations have a limit lower than 8kB. I think it is fine to add a suggestion to limit the number of addresses inspected.
Another quick question, I probably missed something: When the server successfully dials the client and provides the nonce. The client closes the stream either way. How does the server know if the provided nonce was correct?
In the spec it says that all private IP address should be excluded, but it also says it's just for checking reachability on the public internet. That said, we should exclude:
For IPv4:
For IPv6:
In the rust-libp2p implementation there was a PR discussing those globally reachable IP address: libp2p/rust-libp2p#3814
Also this list is probably not complete and not formal, it's also a small nitpick.
@umgefahren it'd be better if you can add the comments as a review, commenting on the specific section.
In the spec it says that all private IP address should be excluded, but it also says it's just for checking reachability on the public internet. That said, we should exclude:
It means non public. Happy to change the wording to non public
.
Another quick question, I probably missed something: When the server successfully dials the client and provides the nonce. The client closes the stream either way. How does the server know if the provided nonce was correct?
A correct server will always provide the correct nonce, no? This issue should be easy enough to debug for implementors without signalling from the client. Is there any benefit to the server knowing that it provided an incorrect nonce?
@umgefahren it'd be better if you can add the comments as a review, commenting on the specific section.
I will do that the next time. I'm sorry.
In the spec it says that all private IP address should be excluded, but it also says it's just for checking reachability on the public internet. That said, we should exclude:
It means non public. Happy to change the wording to
non public
.
Thanks for clarification.
Another quick question, I probably missed something: When the server successfully dials the client and provides the nonce. The client closes the stream either way. How does the server know if the provided nonce was correct?
A correct server will always provide the correct nonce, no? This issue should be easy enough to debug for implementors without signalling from the client. Is there any benefit to the server knowing that it provided an incorrect nonce?
I think there is a benefit. There is a pathological example where the network configuration or a NAT forward traffic to the wrong libp2p node. Not the one that requested the dial back, but a different one. In that case the server would report reachability on that address, but it's actually not reaching the peer in question. I'm not an expert enough here to think of any case where that might occur apart from a bad config or a malicious actor.
In this case, the client sees that the server is reporting OK-Reachable but it has not received the nonce, so it should reject the response.
Another quick question, I probably missed something: When the server successfully dials the client and provides the nonce. The client closes the stream either way. How does the server know if the provided nonce was correct?
A correct server will always provide the correct nonce, no? This issue should be easy enough to debug for implementors without signalling from the client. Is there any benefit to the server knowing that it provided an incorrect nonce?
I think there is a benefit. There is a pathological example where the network configuration or a NAT forward traffic to the wrong libp2p node.
If we dial the node with /p2p
, the connection will never be fully established if we end up at a different node so you can't send the nonce over.
So since that is not possible, the implementation doesn't needs to handle this case, right?
But thanks for the clarification and sorry for the dumb questions.
So since that is not possible, the implementation doesn't needs to handle this case, right?
Yep I think you are right! We can assume that this will never happen. Feel free to use debug_assert
if you want to be sure!
But thanks for the clarification and sorry for the dumb questions.
No worries at all! I think your questions are pretty spot on actually :)
While doing the rust-libp2p implementation, we discovered a race condition, which we are now circumventing by a 100ms delay. You can read the finally comment by @thomaseizinger here: https://github.com/umgefahren/rust-libp2p/pull/1#issuecomment-1872977004
It happens when the server successfully performs a dial back, thus sends the confirmation of the address back to the client. However the client hasn't progressed enough to be notified of that successful dial back when receiving the confirmation. In that case the client wrongly assumed an address was confirmed where no dial back occurred.
In that case the client wrongly assumed an address was confirmed where no dial back occurred.
Minor correction here: The behaviour is usually that the client discards the "successful" confirmation because it has not yet processed the dial-back so it thinks the server is sending it a confirmation without having actually done the dial.
I think the correct way to solve this would be to add an ACK message from the client back to the server for the dial-back where the client can say: "Yes I've processed your dial-back". The server can then proceed to respond on the other stream and thus guarantee that we don't have a race condition between the two streams.
You can read the closing of the stream as the ACK. See: https://github.com/libp2p/go-libp2p/blob/sukun/autonat-v2-2/p2p/protocol/autonatv2/server.go#L251-L257
The spec also dictates closing the stream: https://github.com/libp2p/specs/blame/autonat-v2/autonat/autonat-v2.md#L87
Do you think an explicit ACK is better?
You can read the closing of the stream as the ACK. See: libp2p/go-libp2p@
sukun
/autonat-v2-2/p2p/protocol/autonatv2/server.go#L251-L257The spec also dictates closing the stream:
autonat-v2
/autonat/autonat-v2.md#L87 (blame)Do you think an explicit ACK is better?
Yeah I think so. I associate closing a stream with "I have no more data to write". The client never writes data so why wouldn't it immediately close the stream? Also, reading a stream and waiting for that to fail because it has been closed it also somewhat odd :man_shrugging:
The client never writes data so why wouldn't it immediately close the stream?
That's a fair point. I'll add an ACK.
Updated the specs with a DialBackResponse
@sukunrt given that AutoNatv2 is merged in two reference implementations https://github.com/libp2p/rust-libp2p/pull/5526 (released in 0.13.0) and https://github.com/libp2p/go-libp2p/pull/2469 (released in 0.36.1)
Are there any outstanding comments that need to be addressed before this pull request can be merged? - if there are any that are non-blocking, can they be addressed in follow up PRs? Also, the maturity should be either a Recommendation (I believe there is demonstrated interop between Go and Rust impls?)
Also, the maturity should be either a Recommendation (I believe there is demonstrated interop between Go and Rust impls?)
@sukunrt and I did interop testing and successfully verified that they are working together.
The implementation is not used in go-libp2p yet. We should merge this after we start inferring reachability in go-libp2p.
First draft for autonat v2. https://github.com/libp2p/specs/issues/503
This protocol allows for testing reachability on exactly one address. This helps determine reachability at an address level. This also simplifies the protocol a lot.
I'll change the spec to reflect the discussion on dialing a different ip address from the nodes observed ip address: https://github.com/libp2p/specs/issues/536
Discussion for nonce in message is here: https://github.com/libp2p/go-libp2p/issues/1480 and this comment in particular https://github.com/libp2p/go-libp2p/issues/1480#issuecomment-1122844443