Open teiclap opened 4 years ago
Isn't the scenario you are referring to the one being described in draft-ietf-tsvwg-natsup? Maybe such SCTP support can be implemented and then your setup is supported.
Calling
usrsctp_sysctl_set_sctp_blackhole(2);
after initialising the stack disables the sending of packets containing ABORT chunks in response to out of the blue packets. Does that work around your issue with the NAT instance?
The scenario is similar, but not exactly the same. In the K8s replica case the SCTP Endpoint is shared among SCTP instances, thus they all will share the same external and internal port numbers and the external IP address. I noticed that HB-request race condition is actually the only problem in the K8s replica scenario, that's why I'd rather see HB-request OOTB the only one to be silently discarded rather than disabling the defensive OOTB behavior completely. K8s doesn't allow more than one IP address per Pod, thus the local multihoming case doesn't apply.
usrsctp_sysctl_set_sctp_blackhole(2);
Can be a workaround in the short term, but it requires the SCTP User to have knowledge about details of the implementation. It would be better to have a compile-time configuration flag, still I would prefere keeping the ABORT generation for other OOTB cases.
The scenario is similar, but not exactly the same. In the K8s replica case the SCTP Endpoint is shared among SCTP instances, thus they all will share the same external and internal port numbers and the external IP address. I noticed that HB-request race condition is actually the only problem in the K8s replica scenario, that's why I'd rather see HB-request OOTB the only one to be silently discarded rather than disabling the defensive OOTB behavior completely.
But if the NAT would follow what is described in the ID, the problem would not be there. The HEARTBEAT would be delivered to the correct endpoint, since the remote address is not important.
K8s doesn't allow more than one IP address per Pod, thus the local multihoming case doesn't apply.
Sure.
usrsctp_sysctl_set_sctp_blackhole(2);
Can be a workaround in the short term, but it requires the SCTP User to have knowledge about details of the implementation. It would be better to have a compile-time configuration flag, still I would prefere keeping the ABORT generation for other OOTB cases.
Please note that the sysctl variable to to allow to make attackers life harder. That is why it is there. Not for working around limited NAT implementations.
Please note that the server is free to use its secondary address for all packets it is sending. So there is no reason why you want to limit you special handling to HEARTBEAT chunks, just because that is what you observed up to now...
The reason why I wish HB-request for special handling is because SCTP does probe the path before using it for any traffic and NAT (linux sctp_conntrack) doesn't recognize the association based on vTag, but only cares about source/destination IP address and ports. The race condition happens between the local SCTP that tried to probe the secondary path and the remote SCTP that also tried to proble that secondary path. If the local SCTP arrives first to NAT, then NAT can create a table entry between the internal IP address of the right instance and the remote SCTP. When HB will arrive from the remote SCTP it will be recognized by NAT and be forwarded to the right local instance. Successive traffic will go on through NAT with no issues. If the HB originated by the remote SCTP comes first, then NAT doesn't recognize it as belonging to an existing Association, thus it will try to send it to one of the local SCTP serving that port number by using a random approach. When NAT chooses a different local SCTP than the sender, this will reply to HB-request with ABORT.
The reason why I wish HB-request for special handling is because SCTP does probe the path before using it for any traffic and NAT (linux sctp_conntrack) doesn't recognize the association based on vTag, but only cares about source/destination IP address and ports.
That is a limitation of the NAT you are using.
The race condition happens between the local SCTP that tried to probe the secondary path and the remote SCTP that also tried to proble that secondary path.
The local end-point does the path verification for the second address. This is only done if you don't provide both addresses in the sctp_connectx()
call. If you provide both addresses, both are confirmed.
The server does not do path verification, since it sees only a single address of the client. It is free to use any local IP address for any outgoing at any time.
If the local SCTP arrives first to NAT, then NAT can create a table entry between the internal IP address of the right instance and the remote SCTP. When HB will arrive from the remote SCTP it will be recognized by NAT and be forwarded to the right local instance. Successive traffic will go on through NAT with no issues.
Then work around this limited NAT by issuing a HEARTBEAT to the other IP-address right after the association has been established.
If the HB originated by the remote SCTP comes first, then NAT doesn't recognize it as belonging to an existing Association, thus it will try to send it to one of the local SCTP serving that port number by using a random approach. When NAT chooses a different local SCTP than the sender, this will reply to HB-request with ABORT.
And why is your server multihomed? It does't really gives you anything, or am I missing something?
In the considered network scenario NAT is the default networking behavior of K8s and is based on linux iptables implementation and sctp_conntrack kernel module. The network node inside K8s only knows the primary address and port-number of the remote SCTP EP. The remote SCTP EP is not mandate to be multihomed, but in reality all of them are. Local SCTP is alwasy Client, remote is always Server. I have seen that once Association is established, the Server probes the secondary paths independently from the Client. I can confirm that behavior for LKSCTP and other implementations. I think that using usrsctp_sysctl_set_sctp_blackhole(2); can fix the problem, still it's amending on rfc4960 as it states that ABORT "should" be sent for OOTB packets.
In the considered network scenario NAT is the default networking behavior of K8s and is based on linux iptables implementation and sctp_conntrack kernel module.
And that is one implementation.
The network node inside K8s only knows the primary address and port-number of the remote SCTP EP. The remote SCTP EP is not mandate to be multihomed, but in reality all of them are.
That is your decision how things operate. If have seen setups using multihoming where clients how multiple IP-addresses of the peer, because that way they can already use both addresses for connection setup.
Local SCTP is alwasy Client, remote is always Server. I have seen that once Association is established, the Server probes the secondary paths independently from the Client. I can confirm that behavior for LKSCTP and other implementations.
The server only has a single peer address. It does not do path verification, because it is verified by the handshake. The server is free to use any IP-address for sending its packets. It can do this, for example, for sending DATA chunks or SACK chunks. That is implementation dependent.
I think that using usrsctp_sysctl_set_sctp_blackhole(2); can fix the problem, still it's amending on rfc4960 as it states that ABORT "should" be sent for OOTB packets.
RFC has the rules it has for good reasons. I do not see that it is necessary to change them. As I said, the problem is not limited to HEARTBEAT chunks.
Let's consider the scenario once more. The execution environment is K8s, that has not very much of flexibility, at least it's possible to set the networking in order not to scramble ports when masquerading, but NAT does know SCTP only via IP addresses and port numbers. Nevetheless, K8s and its execution enviroment are getting important and SCTP has troubles in being deployed in K8s. Similar to the picture described in draft-ietf-tsvwg-natsupp at section 7.2 but implementing Host-A by means of a pair of replicas. Actually what is new and beyond the rfc is the implementation of the same host via multiple instances of the protocol stack, still sharing exactly the same Endpoints. We have 2 paths between Host-A and Host-B, path-AB' and path-AB", where path-AB' is used for Association Initiation. Since path-AB" is not used for Association Initiation, according to rfc it needs to be probed before being used, I don't think the implementors can avoid that strong construct, nor other traffic can flow on that path before HB/HB-ack has been completed in each direction. That's why I am pointing in HB-request chunk only, the other parts of rfc, as you stated, have good reasons for being there. That K8s specific behavior may be implemented by defining a value for usrsctp_sysctl_set_sctp_blackhole other than 0 and 1 as it's currently implemented.
I think that Section 8.4 may be improved a little by explaining that the OOTB ABORT should be used only when the offending traffic is related to one of the Endpoints owned by the SCTP stack and not in general, so that parallel instances of SCTP Stack can exist, thus solving the well-known issue with LKSCTP.
Let's consider the scenario once more. The execution environment is K8s, that has not very much of flexibility, at least it's possible to set the networking in order not to scramble ports when masquerading, but NAT does know SCTP only via IP addresses and port numbers.
I understand. However using this kind of NAT enforces some constraints to the SCTP usage when it comes to multihoming.
Nevetheless, K8s and its execution enviroment are getting important and SCTP has troubles in being deployed in K8s. Similar to the picture described in draft-ietf-tsvwg-natsupp at section 7.2 but implementing Host-A by means of a pair of replicas.
I think this is exactly the case. It seems you have multiple end-points behind the NAT and they are all sharing the same port number. If talking to the same peer, it would be a port number collision case.
Actually what is new and beyond the rfc is the implementation of the same host via multiple instances of the protocol stack, still sharing exactly the same Endpoints.
Again, I think this is covered by the Internet Draft describing an SCTP aware NAT.
We have 2 paths between Host-A and Host-B, path-AB' and path-AB", where path-AB' is used for Association Initiation.
Just to be crystal clear: Host-A is behind the NAT and uses a single private address, Host-B has two public addresses B' and B'', Host-A is initiating the association towards Host-B. Host A is sending the INIT towards B'.
Since path-AB" is not used for Association Initiation, according to rfc it needs to be probed before being used, I don't think the implementors can avoid that strong construct, nor other traffic can flow on that path before HB/HB-ack has been completed in each direction.
Please note that Host-A knows after connection setup two IP-Addresses of the peer: B' and B''. B' is confirmed because it was provided by the upper layer. Therefore Host-A has to perform verification of the address B'' by sending a path verification HEARTBEAT to it.
Host-B only knows a single IP-address of its peer: A (the public address of the NAT). It is confirmed by the second list entry in Section 5.4. Therefore Host-B is not doing any path verification. Please note that Host-B is free to use B' and B'' as source addresses for all packets it sends towards A right after the handshake. Only A is limited by sending packets only towards B' until the B'' is verified.
That's why I am pointing in HB-request chunk only, the other parts of rfc, as you stated, have good reasons for being there.
There is also a good reason for HB-requests.
Assume an attacker owns address A, wants to attack a victim owning address V and using for is some host B owning address B. the attacker sends an INIT to host B and list IP address V. After the association setup host B will send path verification heartbeats to V. If V would not respond, it will get a lot of them. This allows the attacker to run a packet amplification attack against V.
In summary:
That K8s specific behavior may be implemented by defining a value for usrsctp_sysctl_set_sctp_blackhole other than 0 and 1 as it's currently implemented.
I think I'm suggesting to use a value of 2. This will disable OOTB handling completely, which is what you need. It is against the RFC 4960, but some people prefer to make live harder for being port scanned. That is what the sysctl variable is for.
I think that Section 8.4 may be improved a little by explaining that the OOTB ABORT should be used only when the offending traffic is related to one of the Endpoints owned by the SCTP stack and not in general, so that parallel instances of SCTP Stack can exist, thus solving the well-known issue with LKSCTP.
As described above, it is important to reply to OOTB packets with an ABORT. The behaviour of LKSCTP is NOT REQUIRED by RFC 4960, but it is allowed.
In my view you are using the NAT instance in a scenario which it doesn't support. So the best way out of the problem is to configure the peers to use only a single address. That way you can operate within what is supported by your configuration. Or, implement an SCTP aware NAT and use that. Then the peers can continue to use multihoming (although there is no use of it), and you still operate within what is supported by that config. If you don't want to do this, disable OOTB handling at all nodes behind the NATs (this also includes sending ICMP, destination unreachable / protocol unreachable). That is a work around to be able to operate in a scenario not supported by your setup.
I understand. However using this kind of NAT enforces some constraints to the SCTP usage when it comes to multihoming
Correct, we need to add some special handling when the remote peer is multihomed.
I think this is exactly the case. It seems you have multiple end-points behind the NAT and they are all sharing the same port number. If talking to the same peer, it would be a port number collision case.
It's not exactly like that. From network perspective we see only one host with, and that host has one or more endpoints. Being the SCTP host implemented as many instances of usrSctp is for redundancy and scalability reasons. The users of SCTP will also see a single instance of the termination (by means of K8s), thus there will never be a duplication of an Association as it's preserved with load-sharing mechanism when creating the Association itself.
Host-B only knows a single IP-address of its peer: A (the public address of the NAT). It is confirmed by the second list entry in Section 5.4. Therefore Host-B is not doing any path verification. Please note that Host-B is free to use B' and B'' as source addresses for all packets it sends towards A right after the handshake. Only A is limited by sending packets only towards B' until the B'' is verified.
What I have seen is that Host-B does path verification, at least LKSCTP does it, as well as other protocol stacks. I think that this is not a bad behavior as the simple knowledge about the existence of the single IP address of a remote single-homed peer doesn't guarantee that a path exists at a given time between all the local IP addresses and that remote IP address as the network may be designed in a way that makes path redundancy.
In my view you are using the NAT instance in a scenario which it doesn't support. So the best way out of the problem is to configure the peers to use only a single address. That way you can operate within what is supported by your configuration. Or, implement an SCTP aware NAT and use that. Then the peers can continue to use multihoming (although there is no use of it), and you still operate within what is supported by that config. If you don't want to do this, disable OOTB handling at all nodes behind the NATs (this also includes sending ICMP, destination unreachable / protocol unreachable). That is a work around to be able to operate in a scenario not supported by your setup.
Yes, definetly K8s (Linux NAT and conntrack) doesn't support SCTP other than in a very basic way. On the other hand changing the configuration of the peer is not possible, and offering the SCTP based service without redundancy is not an option. I see the problems in disabling completely the OOTB handling, I also see that it's permitted in the rfc but I would like avoiding as much as possible, that's why the proposal is for disabling the HB-request only. Would you permit an usrsctp_sysctl_set_sctp_blackhole option tuned for blackholing only HB-request in usrSctp? It's not supposed to solve all possible problems between SCTP and NAT but only the case covered in this discussion. As soon as the draft is approved and NAT getting updated with the results from the draft, including the conntrack modules, we would remove that HB blackhole option.
I understand. However using this kind of NAT enforces some constraints to the SCTP usage when it comes to multihoming
Correct, we need to add some special handling when the remote peer is multihomed.
I think this is exactly the case. It seems you have multiple end-points behind the NAT and they are all sharing the same port number. If talking to the same peer, it would be a port number collision case.
It's not exactly like that. From network perspective we see only one host with, and that host has one or more endpoints. Being the SCTP host implemented as many instances of usrSctp is for redundancy and scalability reasons. The users of SCTP will also see a single instance of the termination (by means of K8s), thus there will never be a duplication of an Association as it's preserved with load-sharing mechanism when creating the Association itself.
Host-B only knows a single IP-address of its peer: A (the public address of the NAT). It is confirmed by the second list entry in Section 5.4. Therefore Host-B is not doing any path verification. Please note that Host-B is free to use B' and B'' as source addresses for all packets it sends towards A right after the handshake. Only A is limited by sending packets only towards B' until the B'' is verified.
What I have seen is that Host-B does path verification, at least LKSCTP does it, as well as other protocol stacks. I think that this is not a bad behavior as the simple knowledge about the
Again: Why do you think LKSCTP is doing path verification? It only sees a single peer address and that address is already verified by the handshake. So there are no peer addresses to be verified.
Just to be clear: Path verification is NOT used to test address pairs. It is only about verifying that the addresses reported by the peer actually belong to that peer. So LKSCTP does NOT do path verification. It just seems to send an HEARTBEAT using B'' as the source address. It can send any packet with that address. Only the node behind that sees two addresses of the peer and needs to verify the send address.
So if you want to suppress the OOTB handling, you have to suppress it for all packets, not only for HEARTBEAT chunks.
existence of the single IP address of a remote single-homed peer doesn't guarantee that a path exists at a given time between all the local IP addresses and that remote IP address as the network may be designed in a way that makes path redundancy.
In my view you are using the NAT instance in a scenario which it doesn't support. So the best way out of the problem is to configure the peers to use only a single address. That way you can operate within what is supported by your configuration. Or, implement an SCTP aware NAT and use that. Then the peers can continue to use multihoming (although there is no use of it), and you still operate within what is supported by that config. If you don't want to do this, disable OOTB handling at all nodes behind the NATs (this also includes sending ICMP, destination unreachable / protocol unreachable). That is a work around to be able to operate in a scenario not supported by your setup.
Yes, definetly K8s (Linux NAT and conntrack) doesn't support SCTP other than in a very basic way. On the other hand changing the configuration of the peer is not possible, and offering the SCTP based service without redundancy is not an option. I see the problems in disabling completely the OOTB handling, I also see that it's permitted in the rfc but I would like avoiding as much as possible, that's why the proposal is for disabling the HB-request only. Would you permit an usrsctp_sysctl_set_sctp_blackhole option tuned for blackholing only HB-request in usrSctp? It's not supposed to solve all possible problems between SCTP and NAT but only the case covered in this discussion.
As I explained above, this wouldn't help you. Node running LKSCTP (or other stacks) does not need to verify any peer address and it can use any local address immediately after the association setup. So if you want to use the ABORT suppression method as a work around, you need to disable OOTB handling for all packets, not only for packets containing HEARTBEAT chunks.
As soon as the draft is approved and NAT getting updated with the results from the draft, including the conntrack modules, we would remove that HB blackhole option.
Again: Why do you think LKSCTP is doing path verification? It only sees a single peer address and that address is already verified by the handshake. So there are no peer addresses to be verified.
That's the way it behaves. I guess LKSCTP does probe paths and not only addresses. The scenario where race condition occurs has been observed by instantiating a user-space implementation of SCTP as service with 2 replicas in k8s and using the same user space implementation first and LKSCTP then as dual-homed remote server. The test wasn't ran with usrSctp though.
Assuming that there are other SCTP implementation that do not probe the path, but only the address as in 5.4, then the only way forwards is to disable OOTB supervision totally, or by forcing the client installation to probe for the whole set of remote IP addresses before the remote peer does any activity, for instance by sending multiple COOKIE ECHO, or sending COOKIE ECHO towards the seconday IP address (in our case we have at most 2 IP addresses). May be this a viable alternative, in line with the rfc?
Again: Why do you think LKSCTP is doing path verification? It only sees a single peer address and that address is already verified by the handshake. So there are no peer addresses to be verified.
That's the way it behaves. I guess LKSCTP does probe paths and not only addresses.
And that is completely valid. The peer can send packets containing arbitrary chunks. That is why ignoring only HEARTBEATs doesn't work in general.
The scenario where race condition occurs has been observed by instantiating a user-space implementation of SCTP as service with 2 replicas in k8s and using the same user space implementation first and LKSCTP then as dual-homed remote server.
I think it should work with a generic compliant SCTP implementation as the peer, not with some version of some specific implementation.
The test wasn't ran with usrSctp though.
Assuming that there are other SCTP implementation that do not probe the path, but only the address as in 5.4, then the only way forwards is to disable OOTB supervision totally, or by
Sure, the FreeBSD kernel implementation does it and therefore also the usrsctp stack.
forcing the client installation to probe for the whole set of remote IP addresses before the remote peer does any activity, for instance by sending multiple COOKIE ECHO, or sending COOKIE ECHO towards the seconday IP address (in our case we have at most 2 IP addresses). May be this a viable alternative, in line with the rfc?
You can't send a COOKIE-ECHO to unconfirmed addresses. If the upper layer on the client side provides both addresses of the peer, one could send packet containing the INIT-chunk to both address. Right now we do this only for transmissions. But again: This requires getting the addresses from the upper layer.
Not sure why you want to change the implementation (which you are free to do, it is open source), and not tweak the sysctl parameter?
You can't send a COOKIE-ECHO to unconfirmed addresses. If the upper layer on the client side provides both addresses of the peer, one could send packet containing the INIT-chunk to both address. Right now we do this only for transmissions. But again: This requires getting the addresses from the upper layer.
At the end of section 5.4, sending of COOKIE-ECHO to an unconfirmed address is permitted if bundled with an HB-request. In the same section is also stated that probing shall be started when an association moves to the ESTABLISHED state. In order to avoid race condition at NAT, once received INIT-ACK and thus having the knowledge of all the peer's addresses, it could be possible to send COOKIE-ECHO to the secondary address bundled with HB, still the association is not established yet, thus I wonder if it would be seen as a protocol violation.
The reason for (slightly) changing the implementation rather than disabling features is an attempt to find a way that keeps rfc compatibility and at the same time builds the traffic in a way that allows a generic conntrack-based NAT work as wished.
You can't send a COOKIE-ECHO to unconfirmed addresses. If the upper layer on the client side provides both addresses of the peer, one could send packet containing the INIT-chunk to both address. Right now we do this only for transmissions. But again: This requires getting the addresses from the upper layer.
At the end of section 5.4, sending of COOKIE-ECHO to an unconfirmed address is permitted if bundled with an HB-request. In the same section is also stated that probing shall be started when an association moves to the ESTABLISHED state.
That is correct. However, the drawback of sending the COOKIE-ECHO to a different address to which the INIT was sent is that you assume that both paths work when setting up the association.
In order to avoid race condition at NAT, once received INIT-ACK and thus having the knowledge of all the peer's addresses, it could be possible to send COOKIE-ECHO to the secondary address bundled with HB, still the association is not established yet, thus I wonder if it would be seen as a protocol violation.
No, you can do that. But it has the drawback mentioned above...
The reason for (slightly) changing the implementation rather than disabling features is an attempt to find a way that keeps rfc compatibility and at the same time builds the traffic in a way that allows a generic conntrack-based NAT work as wished.
I see.
@tuexen If I would like to contribute with a PR to implement this, what is the procedure? What is required to get it accepted?
What tests would be needed, and how to implement them? I can test on Linux, and perhaps freeBSD if I install it in a kvm, but not Windows. Can your CI be used?
Best Regards, Lars Ekman
Hi @uablrek,
when you open a PR, compile checks are executed automatically.
We have an additional, independent buildbot CI system, which executes multiple runtime checks on different platform. Buildbot Console But it does't run on PRs, only after a change has been merged.
If you open a PR, I can trigger our Buildbot system manually and provide you with the results.
Best, Felix
Hi @uablrek,
it might help to agree on what will be implemented first. This is not clear to me... From a code perspective, changes to usrsctp which affect the generic SCTP code will also be committed to FreeBSD, since the sources are kept in sync.
Best regards Michael
it might help to agree on what will be implemented first. This is not clear to me...
I totally agree. To be honest it is not clear to me either :smile:
But I believe I understand the problem; In a K8s environment usrsctp comes in conflict with the Linux kernel sctp (using conntrack NAT) (or even with other usrsctp's belonging to other tenans in the same cluster?).
I will try to sort this out and come with some concrete suggestion.
it might help to agree on what will be implemented first. This is not clear to me...
I totally agree. To be honest it is not clear to me either 😄
But I believe I understand the problem; In a K8s environment usrsctp comes in conflict with the Linux kernel sctp (using conntrack NAT) (or even with other usrsctp's belonging to other tenans in the same cluster?).
I will try to sort this out and come with some concrete suggestion.
Great. Please do not assume that the peer is LKSCTP. It can be any SCTP stack. So I think just not sending ABORTs for OOTB is the appropriate way to handle this. Using
usrsctp_sysctl_set_sctp_blackhole(2);
gives you that...
As I understand the requested change is the one described in https://github.com/sctplab/usrsctp/issues/499#issuecomment-654664246; send a bundled COOKIE-ECHO+HB-request to the "other" address if multiple addresses are in the INIT msg.
The intention is to setup a multihomed NAT way when the SCTP server executes in a K8s POD.
I am not sure that this will work but I am not a SCTP guy but I do know K8s networking.
Anyway here is what I will do; I will fork the usrsctp project (already done) and try to do the requested update but not create a PR immediately. Then we do a PoC; @teiclap builds usrsctp from the fork and verify that it really works in K8s and solves the multihoming/NAT problem.
@weinrank @tuexen To test this I must setup sctp multihoming. Is there some test program (in "usrsctp/programs/" or elsewhere) that can serve as an example. I read the code in "programs/" but there does not seem to be any using multihoming.
@uablrek I'm not sure about the architecture... I thought that you run multiple instances of usrsctp behind a NAT acting as a single homed client, all talking to a server which is dual-homed. You are stating that you want to run multiple servers. Can you clarify?
Here is what I think are the drawbacks of the proposed solution:
Regarding the question of examples: tsctp
, for examples, supports multihoming. Although it only supports IPv4.
@tuexen I assumed the server would be in K8s. First because the K8s load-balancing (that uses NAT) is for incoming requests only, and second because otherwise the requirement to respond on the other interface would be on a remote server that may not be usrsctp at all. But as I said I am not an SCTP guy so I might have missed something.
@teiclap Can you please explain the architecture?
I agree with the drawbacks. The feature must be configurable, but if is really enables a resilient sctp solution in K8s without modifications (like MULTUS and friends) it is worth something. I would however like to see that verified, hence the PoC. The most important property of multi-homing in this case I think is resilience, scalability comes second. But again I must refer to @teiclap
Thanks for the pointer to tsctp
. Sorry I missed it. I only checked for flags like "-second-interface" or something like that. I intend to use the same network setup as have used for experiments with MPTCP; https://stackoverflow.com/questions/61796994/how-to-use-mptcp-included-in-linux-kernel-5-6-x.
@tuexen I assumed the server would be in K8s. First because the K8s load-balancing (that uses NAT) is for incoming requests only, and second because otherwise the requirement to respond on the other interface would be on a remote server that may not be usrsctp at all. But as I said I am not an SCTP guy so I might have missed something.
Please note that the client sends a packet with the INIT chunk, the server responds with a packet containing an INIT-ACK chunk, and then the client sends a packet with the COOKIE-ECHO chunk. Since you want to change the sending of the COOKIE-ECHO chunk, you want to change the client side.
My understanding of the problem is that one single homed client behind a NAT sends an INIT to an external server having two addresses. The server responds with an INIT-ACK and lists its addresses. The client sends a COOKIE-ECHO to the server which responds with an COOKIE-ACK. All this uses only a single address of each side. Now the client will send a HEARTBEAT to the other address of the server to perform address verification.
Now it seems that LKSCTP, for whatever reason, sends a HB from its other address to the client. This is allowed, but implementation specific. Actually it can send any packet anytime using these addresses.
The race is at the NAT: If the verification HEARTBEAT is seen first, everything is fine. However, if the packet from the server is seen first, it gets delivered to some of the clients (this is the loadsharing feature). If the client selected is not the right one, it will send an ABORT and the association is dead.
The simplest solution would be to avoid sending the ABORT for OOTB messages. This is already implemented and could be activated.
@teiclap Can you please explain the architecture?
I agree with the drawbacks. The feature must be configurable, but if is really enables a resilient sctp solution in K8s without modifications (like MULTUS and friends) it is worth something. I would however like to see that verified, hence the PoC. The most important property of multi-homing in this case I think is resilience, scalability comes second. But again I must refer to @teiclap
My suggestion was: Just disable the OOTB handling.
Thanks for the pointer to
tsctp
. Sorry I missed it. I only checked for flags like "-second-interface" or something like that. I intend to use the same network setup as have used for experiments with MPTCP; https://stackoverflow.com/questions/61796994/how-to-use-mptcp-included-in-linux-kernel-5-6-x.
If you don't locally bind to a specific address, SCTP uses all applicable addresses.
@tuexen The architecture is as you described: there are a number of instances of SCTP Stack sharing the same SCTP EP behind NAT. It's seens from the external world as a single SCTP Stack. The Use Case is the stack being a pure client that is connected to multiple servers, those servers are dual-homed.
Here is what I think are the drawbacks of the proposed solution:
A connection can only be established, if both IP addresses are reachable. Normally you want to to avoid this.
Yes, this is correct, in case the COOKIE-ECHO request goes to timeout, it shall be attempted to the source IP address used by the peer for INIT-ACK. At this point NAT table has been already adjusted so the Association setup can continue as in the legacy.
It does not scale if the server uses more than two addresses.
This is also correct, but in the use case the server has at most two addresses.
The solution seems to try to work around an implementation specific feature you are observing with one implementation.
That is true, we need the solved of the shared SCTP Stack in Kubernetes networking based on NAT, running on Linux and using iptables and sctp_conntrack version of Linux kernel.
The only advantage of the proposed behavior is that it covers the use case being still full rfc4960 compliant. I think that the OOTB handling in the rfc is valid and disabling it fully should be avoided.
Thinking about it...
The problem is that the NAT instance forwards incoming packets for which it has no entry in its tables to an arbitrary internal endpoint. I understand that this is used for load sharing. But this should only applied to packets which contain an INIT chunk. So can't you just limit the forwarding to packets which contain an INIT chunk and drop all other packets at the NAT instance. Doing loadsharing for packets which are not used for connection setup doesn't make sense to me.
The only advantage of the proposed behavior is that it covers the use case being still full rfc4960 compliant. I think that the OOTB handling in the rfc is valid and disabling it fully should be avoided.
But you are compliant by not sending an ABORT. RFC 4960 says:
8) The receiver should respond to the sender of the OOTB packet with
an ABORT.
In RFC 4960bis this will be clarified as
8) The receiver SHOULD respond to the sender of the OOTB packet with
an ABORT.
There is no MUST
. The SHOULD
is intentionally. It was meant for security purposes, but it is valid to use it for other arguments. As long as you can argue why you are not sending an ABORT everything is fine.
The problem is that the NAT instance forwards incoming packets for which it has no entry in its tables to an arbitrary internal endpoint. I understand that this is used for load sharing. But this should only applied to packets which contain an INIT chunk. So can't you just limit the forwarding to packets which contain an INIT chunk and drop all other packets at the NAT instance. Doing loadsharing for packets which are not used for connection setup doesn't make sense to me.
No, K8s only load-balance incoming traffic to virtual load-balancer IPs (or to NodePorts which are not relevant now). Since the POD in K8s is acting as a client and making an outgoing request K8s (and it's load-balancing) is not involved at all.
But there is a NAT from the internal POD address (usually some 192.168.x.x) to the K8s node (host) address where the POD is executing. Return-packets from the server would have the K8s node address as destination and will be "re-NATed" and forwarded to the POD address if a conntrack entry exists. The problem would be if the server sends a packed with it's "other" address as source. There is no conntrack entry and Linux will assume that the packet is really for the node itself (which is the dest), but it will not be load-balanced.
@tuexen Thanks a lot for the explanation. I am sorry you had to explain SCTP fundamentals to me but I am grateful that you had tha patience to do so.
The problem is that the NAT instance forwards incoming packets for which it has no entry in its tables to an arbitrary internal endpoint. I understand that this is used for load sharing. But this should only applied to packets which contain an INIT chunk. So can't you just limit the forwarding to packets which contain an INIT chunk and drop all other packets at the NAT instance. Doing loadsharing for packets which are not used for connection setup doesn't make sense to me.
No, K8s only load-balance incoming traffic to virtual load-balancer IPs (or to NodePorts which are not relevant now). Since the POD in K8s is acting as a client and making an outgoing request K8s (and it's load-balancing) is not involved at all.
But there is a NAT from the internal POD address (usually some 192.168.x.x) to the K8s node (host) address where the POD is executing. Return-packets from the server would have the K8s node address as destination and will be "re-NATed" and forwarded to the POD address if a conntrack entry exists. The problem would be if the server sends a packed with it's "other" address as source. There is no conntrack entry and Linux will assume that the packet is really for the node itself (which is the dest), but it will not be load-balanced.
Thanks for the clarification. So the K8s node is responding... Does the node need to handle SCTP packets? If not, just don't load the SCTP module and make sure it does not send ICMP/ICMPv6 packets indicating that it does not support SCTP. Is that an option?
@tuexen Thanks a lot for the explanation. I am sorry you had to explain SCTP fundamentals to me but I am grateful that you had tha patience to do so.
You are welcome...
The only advantage of the proposed behavior is that it covers the use case being still full rfc4960 compliant. I think that the OOTB handling in the rfc is valid and disabling it fully should be avoided.
But you are compliant by not sending an ABORT. RFC 4960 says: 8) The receiver should respond to the sender of the OOTB packet with an ABORT.
In RFC 4960bis this will be clarified as 8) The receiver SHOULD respond to the sender of the OOTB packet with an ABORT.
There is no MUST. The SHOULD is intentionally. It was meant for security purposes, but it is valid to use it for other arguments. As long as you can argue why you are not sending an ABORT everything is fine.
@tuexen I see your point. Fully disabling OOTB does the job and requires no modification. I am not sure what security issues can be created by fully disabling OOTB. May you help with that?
Thanks.
Thanks for the clarification. So the K8s node is responding... Does the node need to handle SCTP packets? If not, just don't load the SCTP module and make sure it does not send ICMP/ICMPv6 packets indicating that it does not support SCTP. Is that an option?
Yes, actually I think that is the current solution. When sctp support was introduced in K8s we (Ericsson) requested that the module should not be "auto-loaded" by K8s. I was not involved directly so I don't have the details but there is a long-running issue in K8s somewhere. (the module is not auto-loaded btw)
I think the problem now is that other tenants are using LKSCTP in the same cluster.
Comments about LKSCTP in K8s in; https://github.com/kubernetes/kubernetes/pull/64973
The only advantage of the proposed behavior is that it covers the use case being still full rfc4960 compliant. I think that the OOTB handling in the rfc is valid and disabling it fully should be avoided. But you are compliant by not sending an ABORT. RFC 4960 says: 8) The receiver should respond to the sender of the OOTB packet with an ABORT. In RFC 4960bis this will be clarified as 8) The receiver SHOULD respond to the sender of the OOTB packet with an ABORT. There is no MUST. The SHOULD is intentionally. It was meant for security purposes, but it is valid to use it for other arguments. As long as you can argue why you are not sending an ABORT everything is fine.
@tuexen I see your point. Fully disabling OOTB does the job and requires no modification. I am not sure what security issues can be created by fully disabling OOTB. May you help with that?
Here is the reason why you might not want to respond with an ABORT: If you reply with an ABORT in response to an INIT, you allow an attacker to get an instant indication that the port is not listening. So you can do a fast portscan. If you don't, the attacker doesn't know how long to wait and has to deal with the possibility of packet loss. Some people even prefer that an end-point is in a "stealth-mode", only responding when it helps in communications it actually wants to do. This is served by not responding to OOTB packets. That explains why you have the different settings for usrsctp_sysctl_set_sctp_blackhole
. FreeBSD has similar settings for TCP and SCTP (see man blackhole).
If a host is not responding to OOTB packet with an ABORT:
Does this help?
Thanks.
Thanks for the clarification. So the K8s node is responding... Does the node need to handle SCTP packets? If not, just don't load the SCTP module and make sure it does not send ICMP/ICMPv6 packets indicating that it does not support SCTP. Is that an option?
Yes, actually I think that is the current solution. When sctp support was introduced in K8s we (Ericsson) requested that the module should not be "auto-loaded" by K8s. I was not involved directly so I don't have the details but there is a long-running issue in K8s somewhere. (the module is not auto-loaded btw)
I think the problem now is that other tenants are using LKSCTP in the same cluster.
Let me learn something here: You can load the SCTP on the K8s host and then you can use it in a container. Right? Can you still use a userland stack in another container?
You are saying that some containers use the kernel stack and therefore the module is loaded. Couldn't you add a rule to the host that it drops outgoing packets which contain an ABORT chunk and have the T-bit set? That would drop ABORTs, which are sent in response to OOTB packets which do not contain INIT-chunks.
That would mean that the containers using the kernel SCTP stack have the same problem.
Sure. If you reply with an ABORT in response to an INIT, you allow an attacker to get an instant indication that the port is not listening. So you can do a fast portscan. If you don't, the attacker doesn't know how long to wait and has to deal with the possibility of packet loss. Some people even prefer that an end-point is in a "stealth-mode", only responding when it helps in communications it actually wants to do. This is served by not responding to OOTB packets. That explains why you have the different settings for usrsctp_sysctl_set_sctp_blackhole. FreeBSD has similar settings for TCP and SCTP (see man blackhole).
Thanks.
Then we come into something difficult. The rfc suggests as preferred implementation that OOTB shall generate an ABORT, we see that the case of sending an ABORT would in some way help the malicious attacker. Having the stealth-mode as default would help a lot cases where the SCTP Endpoint is shared among instances of the protocol stack, especially in the Cloud paradigma. Should the new rfc4960bis change the OOTB preferred handling into stealth-mode, would make it much easier also having a number of SCTP protocols stacks from different implementation within the same Cloud based environment. Is there any chance to get it?
Sure. If you reply with an ABORT in response to an INIT, you allow an attacker to get an instant indication that the port is not listening. So you can do a fast portscan. If you don't, the attacker doesn't know how long to wait and has to deal with the possibility of packet loss. Some people even prefer that an end-point is in a "stealth-mode", only responding when it helps in communications it actually wants to do. This is served by not responding to OOTB packets. That explains why you have the different settings for usrsctp_sysctl_set_sctp_blackhole. FreeBSD has similar settings for TCP and SCTP (see man blackhole). Thanks.
Then we come into something difficult. The rfc suggests as preferred implementation that OOTB shall generate an ABORT, we see that the case of sending an ABORT would in some way help the malicious attacker.
As it does if you are not sending ABORTs.
Having the stealth-mode as default would help a lot cases where the SCTP Endpoint is shared among instances of the protocol stack, especially in the Cloud paradigma.
I would argue that if you want to deploy something which involves a NAT, use an appropriate one. The problem you are facing is, in my view, a consequence of the implementation you are using. So why are packets delivered to the K8s host, which are not sent to it?
Should the new rfc4960bis change the OOTB preferred handling into stealth-mode, would make it much easier also having a number of SCTP protocols stacks from different implementation within the same Cloud based environment.
I would say no. The SHOULD
instead of a MUST
allows you what you want to do. The FreeBSD implementation even supports the stealth mode. Since you are using Linux, you would need to implement this, if it doesn't support this yet. But it is fairly easy, I did it 8 years ago for FreeBSD in r229805. You just need to change a sysctl variable on boot (which is a single line in /etc/sysctl.conf
, also on Linux) you you get what you want and you are compliant with the spec.
Is there any chance to get it?
You need to discuss this on tsvwg@ietf.org.
I normally do not comment on this list but I will state right now that I would be strongly against changing the SHOULD to a MUST. The current wording allows for discretion on the part of the implementation as well as the application. Binding it to a MUST forces the developer and where SCTP would be applied to do that.
TCP has a similar “stealth” mode and it too is in the spec that way on purpose!
R
On Aug 25, 2020, at 10:12 AM, Michael Tüxen notifications@github.com wrote:
Sure. If you reply with an ABORT in response to an INIT, you allow an attacker to get an instant indication that the port is not listening. So you can do a fast portscan. If you don't, the attacker doesn't know how long to wait and has to deal with the possibility of packet loss. Some people even prefer that an end-point is in a "stealth-mode", only responding when it helps in communications it actually wants to do. This is served by not responding to OOTB packets. That explains why you have the different settings for usrsctp_sysctl_set_sctp_blackhole. FreeBSD has similar settings for TCP and SCTP (see man blackhole). Thanks.
Then we come into something difficult. The rfc suggests as preferred implementation that OOTB shall generate an ABORT, we see that the case of sending an ABORT would in some way help the malicious attacker.
As it does if you are not sending ABORTs.
Having the stealth-mode as default would help a lot cases where the SCTP Endpoint is shared among instances of the protocol stack, especially in the Cloud paradigma.
I would argue that if you want to deploy something which involves a NAT, use an appropriate one. The problem you are facing is, in my view, a consequence of the implementation you are using. So why packets delivered to the K8s host, which are not sent to it?
Should the new rfc4960bis change the OOTB preferred handling into stealth-mode, would make it much easier also having a number of SCTP protocols stacks from different implementation within the same Cloud based environment.
I would say no. The SHOULD instead of a MUST allows you what you want to do. The FreeBSD implementation even supports the stealth mode. Since you are using Linux, you would need to implement this, if it doesn't support this yet. But it is fairly easy, I did it 8 years ago for FreeBSD in r229805. You just need to change a sysctl variable on boot (which is a single line in /etc/sysctl.conf, also on Linux) you you get what you want and you are compliant with the spec.
Is there any chance to get it?
You need to discuss this on tsvwg@ietf.org.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Randall Stewart rrs@netflix.com
Let me learn something here: You can load the SCTP on the K8s host and then you can use it in a container. Right? Can you still use a userland stack in another container?
Yes, I assume so but I have not tested personally but I guess @teiclap has. The container has it's own network namespace and a program should be able to open a raw socket inside it.
You are saying that some containers use the kernel stack and therefore the module is loaded. Couldn't you add a rule to the host that it drops outgoing packets which contain an ABORT chunk and have the T-bit set? That would drop ABORTs, which are sent in response to OOTB packets which do not contain INIT-chunks.
I guess so. There is some competition of "iptables" in K8s, both K8s and various CNI-plugins add and "sync" iptables rules rather uncontrollably so it may be so that our added rules are removed by others. But this is certainly a preferable option. I will check it.
That would mean that the containers using the kernel SCTP stack have the same problem.
Yes, I think they have. But as far as we know no user of LKSCTP in K8s are using multi-homing or have any plans to do so.
I normally do not comment on this list but I will state right now that I would be strongly against changing the SHOULD to a MUST. The current wording allows for discretion on the part of the implementation as well as the application.
I think @teiclap wants to change "SHOULD send an ABORT" to "SHOULD NOT send an ABORT". The default should be the "stealth mode" and the current default behaviour should be allowed.
But as you can read from my answers above, I would also be against that change. At least for the reasoning given in this discussion (working around a limitation of some middle box implementation).
Binding it to a MUST forces the developer and where SCTP would be applied to do that. TCP has a similar “stealth” mode and it too is in the spec that way on purpose! R …
Let me learn something here: You can load the SCTP on the K8s host and then you can use it in a container. Right? Can you still use a userland stack in another container?
Yes, I assume so but I have not tested personally but I guess @teiclap has. The container has it's own network namespace and a program should be able to open a raw socket inside it.
Normally (without docker or such containers) the problem is not opening a raw socket, but receiving packets on it. If there is a kernel stack for that protocol, packets are delivered to the kernel stack, not to the raw socket. This applies normally to UDP, TCP, and SCTP (and other transport protocols).
You are saying that some containers use the kernel stack and therefore the module is loaded. Couldn't you add a rule to the host that it drops outgoing packets which contain an ABORT chunk and have the T-bit set? That would drop ABORTs, which are sent in response to OOTB packets which do not contain INIT-chunks.
I guess so. There is some competition of "iptables" in K8s, both K8s and various CNI-plugins add and "sync" iptables rules rather uncontrollably so it may be so that our added rules are removed by others. But this is certainly a preferable option. I will check it.
That would mean that the containers using the kernel SCTP stack have the same problem.
Yes, I think they have. But as far as we know no user of LKSCTP in K8s are using multi-homing or have any plans to do so.
Please note the it is the peer, which is multihomed, not the endpoint running on K8s or in its containers.
I normally do not comment on this list but I will state right now that I would be strongly against changing the SHOULD to a MUST. The current wording allows for discretion on the part of the implementation as well as the application.
I think @teiclap wants to change "SHOULD send an ABORT" to "SHOULD NOT send an ABORT". The default should be the "stealth mode" and the current default behaviour should be allowed. But as you can read from my answers above, I would also be against that change. At least for the reasoning given in this discussion (working around a limitation of some middle box implementation).
Binding it to a MUST forces the developer and where SCTP would be applied to do that. TCP has a similar “stealth” mode and it too is in the spec that way on purpose! R …
@tuexen I didn't mean to force any change in the protocol, rather try to influence the implementors in some ways. What you have done with your contribution at r229805 is moving a choice from the implementor dimension to the user dimension. For sure with that change FreeBSD implements the concept of "SHOULD" in the most flexible way: the implementation allows the adopter to choose whether aborting OOTB or not. In an implementors guide, I'd suggest that whenever an option is stated as SHOULD, the implementor SHOULD provide the user a way for deciding about (via API, parameters, sysctl...). This may be a note somewhere in the rfc.
I normally do not comment on this list but I will state right now that I would be strongly against changing the SHOULD to a MUST. The current wording allows for discretion on the part of the implementation as well as the application. I think @teiclap wants to change "SHOULD send an ABORT" to "SHOULD NOT send an ABORT". The default should be the "stealth mode" and the current default behaviour should be allowed. But as you can read from my answers above, I would also be against that change. At least for the reasoning given in this discussion (working around a limitation of some middle box implementation). Binding it to a MUST forces the developer and where SCTP would be applied to do that. TCP has a similar “stealth” mode and it too is in the spec that way on purpose! R …
@tuexen I didn't mean to force any change in the protocol, rather try to influence the implementors in some ways. What you have done with your contribution at r229805 is moving a choice from the implementor dimension to the user dimension. For sure with that change FreeBSD implements the concept of "SHOULD" in the most flexible way: the implementation allows the adopter to choose whether aborting OOTB or not. In an implementors guide, I'd suggest that whenever an option is stated as SHOULD, the implementor SHOULD provide the user a way for deciding about (via API, parameters, sysctl...). This may be a note somewhere in the rfc.
That is up to the implementer and allows for some competition in implementations... If you want this to be changed in the Linux stack, you need to talk to the implementers of that implementation and convince them that it is a good feature: linux-sctp@vger.kernel.org.
When deploying usrSctp in a K8s Pod with replicas, thus distributing an SCTP Endpoint among independent instances of SCTP Stack behind a NAT, the strict compliance to rfc4960 section 8.4 part 8 can cause wrong abortion of Association. The case is : SCTP Client in K8s Pod, single homed, SCTP Server is remote and is multihomed. SCTP Client sends INIT to the primary IP address of the remote server via NAT, NAT creates an entry in the Natting table mapping the Client and the primary address of the server. Once the association is up, the remote Server sends HB-request to the Client from a secondary IP address. Since NAT doesn't know the secondary IP address, it chooses randomly an instance of SCTP Client among the available replicas. When selecting a Client different than the one that has originated the Association, the HB-request will reach an instance of SCTP Stack that doesn't know about the Association, thus will reply with ABORT to the remote Server. The remote Server will close the Association.
Solution is to move HB-request handling from rfc4960 section 8.4 part 8 (reply with ABORT) to part 7 (silently discard). The completion of the multihomed Association happens as soon as the Client will send HB-request towards the secondary address(es), thus enabling NAT with the proper information.
May you please consider that change in usrSctp? (possibly under a selectable option).
Thanks, Claudio Porfiri