Provide alternatives to raw sockets

uablrek commented 3 years ago

Raw sockets clones packets but packets are still processed by the kernel. Illustrated by for instance https://github.com/sctplab/usrsctp/issues/603. Alternatives exist that may be better in many cases, for example:

tun/tap devices. According to wikipedia they exist on Mac/Linux/some BSD/Windows and Solaris.
XDP which is Linux only but probably the best to use on Linux

I would like a way to use any mechanism to deliver packets to user-space, perhaps with some plugin.

tuexen commented 3 years ago

I would suggest to not run more than one SCTP stack on a host. I do not see why you want to use a userland stack and a kernel stack on a host... If you have more than one, you need some demultiplexer. The simplest way to do this is to use UDP encapsulation as defined in RFC 6951.

uablrek commented 3 years ago

There is kind of a "trend" towards multi-tennant clusters (Kubernetes) and then it is hard to put a requirement that the sctp module must not be loaded. Then again, the multi-tennant telco clusters are not really here yet.

tuexen commented 3 years ago

I forgot to mention that @weinrank did some experiments with multistack, which allows running userland stacks in parallel to kernel stacks...

Can't you run your own network stack in a container? (I'm not familiar with Kubernetes, but when using jails in FreeBSD, that is possible).

uablrek commented 3 years ago

Can't you run your own network stack in a container?

Once the kernel module is loaded lksctp becomes active/available in containers. Same as for udp/tcp.

But... it may be possible to create a DROP rule for sctp in iptables so the kernel does not process packets, and with luck the raw socket may work anyway. I have not tried that.

tuexen commented 3 years ago

Hmm. I guess processes are isolated from each other when living in different containers. How is this done with communication endpoints? Are you delegating IP addresses to containers and then incoming packets to the host OS are delivered to the corresponding container?

Just to understand your use case: you control what is running inside the container, you have only partial control about the host OS and you want to communicate with a peer running SCTP/IPv4 and or SCTP/IPv6, but you don't have control over the peer.

I'm trying to understand the constraints the endpoint has to live in...

uablrek commented 3 years ago

Here is my test setup. I am not using K8s yet but it should be similar (IPv6 tested but not shown).

container-2

I don't know if we can control the peer, but I don't think I can assume it in general.

tuexen commented 3 years ago

OK. So I guess the client talks to 192.168.1.1 and 192.168.4.1. I also assume the the client has a route for 192.168.1.1 via its upper interface and a route for 192.168.4.1 via its lower interface.

Do you need to perform some sort of NAT? Can't you use an externally visible IP address exclusively in a container? Having a NAT in the game makes things much more complex... If you are using a NAT, are you just delegating some external addresses to a container or is the granularity finer? If you use IPv6, can you then handle the address within the container?

uablrek commented 3 years ago

The client has routes to the container addresses (10.0.0.x). In the end these are supposed to be virtual IPs (VIP). No NAT is involved. The IPv4 addresses I think are not neccesarily global addresses since they are used in some telco control network (I do not develop the SCTP apps myself only networking).

My assignment is currently to investigate SCTP load-balancing with multihoming. I have so far used lksctp and only tested the usrsctp test programs, but I know the SCTP developers are using user-space sctp so I try to be prepared. Please see

https://github.com/Nordix/nfqueue-loadbalancer/blob/master/sctp.md

uablrek commented 3 years ago

I checked multistack but it has not been updated since 2015 probably because it is obsoleted by XDP. I will also check if UDP encapsulation is acceptable by our apps. That may take some time though.

tuexen commented 3 years ago

The client has routes to the container addresses (10.0.0.x). In the end these are supposed to be virtual IPs (VIP). No NAT is involved. The IPv4 addresses I think are not neccesarily global addresses since they are used in some telco control network (I do not develop the SCTP apps myself only networking).

My assignment is currently to investigate SCTP load-balancing with multihoming. I have so far used lksctp and only tested the usrsctp test programs, but I know the SCTP developers are using user-space sctp so I try to be prepared. Please see

https://github.com/Nordix/nfqueue-loadbalancer/blob/master/sctp.md

Is your load balancer stateless or stateful? I'm not sure, if the source port number of incoming packets will have enough entropy...

If you can manage some state in the load balancer, you could:

For incoming packets containing an INIT chunk, use the initiate tag (and possibly the source port) as an input of a function, which determines which internal node is used for handling the corresponding SCTP association. No state is necessary for this. This function would also be applied to all incoming packets having the T-bit set (containing ABORT, SHUTDOWN-COMPLETE chunks), however the verification tag (and possibly the source port) would be used as the input.
For outgoing packets containing an INIT-ACK chunk, the initiate tag would be extracted and a mapping would be stored at the load balancer. This mapping would be used to figure out which internal node would be used for an incoming packet based on the verification tag (and possibly) the port number as long as the T-bit is not set.
If the adding of such an entry would result in a conflict, the load balancer could just drop the packet. A retransmission of the INIT chunk will trigger a new selection of the initiate tag in the corresponding INIT-ACK chunk. Or the load balancer could send an ABORT to the sender of the INIT-ACK to trigger a resection of the initiate-tag using the VTag and Port Number Collision Error Cause as currently being specified in SCTP NAT.

Please note that the verification tag and the initiate-tag are at constant offsets, so it should be possible to access them in a performant way.

tuexen commented 3 years ago

Question: What is the relation of doing load balancing and looking for alternatives to raw sockets?

uablrek commented 3 years ago

Question: What is the relation of doing load balancing and looking for alternatives to raw sockets?

My stake-holders are likely using user-space sctp so a top question would be, "how?". So I try to be prepared.

I am now extending the load-balancer to support UDP encapsulation. It shouldn't be too hard, just match a UDP port and hash on the SCTP ports within the UDP packet. I hope this will be enough to allow user-space sctp, and if so I may close this issue.

uablrek commented 3 years ago

Is your load balancer stateless or stateful? I'm not sure, if the source port number of incoming packets will have enough entropy...

Stateless. I think the in gain in simplicity and lb-scalability is more important than enthropy. But I will have to confirm that with my stake-holders.

The intention is to use Direct Server Return (DSR) so return traffic will not pass through the LB. But I might have to rethink that. But if so, I will propose to use a SCTP proxy rather than a L3 based load-balancer.

tuexen commented 3 years ago

Question: What is the relation of doing load balancing and looking for alternatives to raw sockets?

My stake-holders are likely using user-space sctp so a top question would be, "how?". So I try to be prepared.

But isn't this a question "how to run a user-space SCTP possibly in addition to a kernel stack" orthogonal to the question how to do load balancing?

I am now extending the load-balancer to support UDP encapsulation. It shouldn't be too hard, just match a UDP port and hash on the SCTP ports within the UDP packet. I hope this will be enough to allow user-space sctp, and if so I may close this issue.

Is the load balancer performing the encapsulation / decapsulation or is it just handling SCTP/UDP? The latter requires the peer also to use SCTP/UDP. The former does not increase the entropy. Are you sure the source port of the packet containing the INIT chunk has enough entropy?

uablrek commented 3 years ago

But isn't this a question "how to run a user-space SCTP possibly in addition to a kernel stack" orthogonal to the question how to do load balancing?

Yes it is. But it is still in my problem domain. This issue is also not about load-balancing, that is just to give you the context. The problem is really to allow lksctp and user-space sctp to co-exist in the same machine (which may be a node in a K8s cluster).

uablrek commented 3 years ago

But learing more about UDP encapsulation I think that is the right way to go. I just hope my stake-holders will agree.

Is the load balancer performing the encapsulation / decapsulation or is it just handling SCTP/UDP?

No, the lb is (and should be) as simple as possible. And yes, this puts a requirement that both the server and client are using udp encap, but that is also the case without a LB.

tuexen commented 3 years ago

Is your load balancer stateless or stateful? I'm not sure, if the source port number of incoming packets will have enough entropy...

Stateless. I think the in gain in simplicity and lb-scalability is more important than enthropy. But I will have to confirm that with my stake-holders.

I agree that simpler is better. I just have seen usages of SCTP where also the port number used by clients was preconfigured, and therefore wouldn't give you any entropy. So please double check.

The intention is to use Direct Server Return (DSR) so return traffic will not pass through the LB. But I might have to rethink that. But if so, I will propose to use a SCTP proxy rather than a L3 based load-balancer.

tuexen commented 3 years ago

But learing more about UDP encapsulation I think that is the right way to go. I just hope my stake-holders will agree.

Is the load balancer performing the encapsulation / decapsulation or is it just handling SCTP/UDP?

No, the lb is (and should be) as simple as possible. And yes, this puts a requirement that both the server and client are using udp encap, but that is also the case without a LB.

Thanks for the clarification. Please make sure that it is acceptable to require that your peer use UDP encapsulation. I guess the relevant specifications have to mention this. UDP encapsulation is supported by FreeBSD, but was only recently added to the Linux kernel, I think. Not sure about proprietary implementations.

tuexen commented 3 years ago

BTW: If you want a stateless load balancer, which is not involved in the actual communication, and the IP addresses of the load balancer and the IP addresses of the actual nodes can be different, one could define a simple extension of SCTP to realise this. The client would send a packet containing the INIT chunk to one of the addresses of the load balancer. The load balancer would forward this the the actual node it selected and that node would respond with a packet containing an INIT-ACK chunk. One would need to define a new parameter, which would be in the INIT-ACK containing the IP address the INIT was originally sent to. It is easy to implement, but would require support from the client. So it would need to be specified in an RFC...

uablrek commented 3 years ago

Thanks for your insights and for thinking on the problem. I am not ready for an RFC though. But it lead me to think that the IPv6 flow label can be used in the same way. If the client can ensure that the flow label is the same in packets on all paths it can be used for load-balancing (as it is intended).

Of course, that comes with a IPv6 constraint, but perhaps the time has come for a shift :-)

tuexen commented 3 years ago

That would assume that:

Each endpoint chooses a random flow label.
The flow label is chosen for the association, so it is the same for all paths.
The flow label is not changed during the lifetime of an association.

The first point is implementation specific. Regarding the second point: the flow label should be the same for all packets requiring the same processing in the network. This does not apply to the association, but to a path. Therefore, RFC 6458 allows to specify the flow label on a per destination address base. Regarding the third point: The Linux TCP implementation changes the flow label on timer based retransmissions. Not that I think this is the right way to do, not that I know if the SCTP stack does the same, but this gives an indication that the flow label might change over time.

In summary: I don't think this will work out.

I have been thinking a bit more about load balancers, which are simple and not a single point of failure. I think the idea described at the end of https://github.com/sctplab/usrsctp/issues/633 seems attractive to me. It requires

support by both end-points, a simple protocol extension.
every server to have public addresses, which are visible to the client.

but the load balancer only would need to deal with incoming INIT chunks. This should be pretty scalable... Would that be an option?

danwinship commented 2 years ago

tun/tap devices. According to wikipedia they exist on Mac/Linux/some BSD/Windows and Solaris.

So I can see how this would work in the diagram you show, where you are basically creating a your own IP just to do SCTP on, but this wouldn't work when you've been assigned an IP by someone else and want to do SCTP on that IP address, right? (eg, two pods in a Kubernetes cluster, one of which wants to accept lksctp connections on its pod IP, and one which wants to accept usrsctp connections on its pod IP).

XDP which is Linux only but probably the best to use on Linux

I think this would work though.

Can't you run your own network stack in a container?

Once the kernel module is loaded lksctp becomes active/available in containers. Same as for udp/tcp.

There was some discussion about whether it might be possible to relax that; eg, add a sysctl to sctp.ko to disable it, which could be set per-network-namespace.

uablrek commented 2 years ago

but this wouldn't work when you've been assigned an IP by someone else and want to do SCTP on that IP address, right? (eg, two pods in a Kubernetes cluster, one of which wants to accept lksctp connections on its pod IP, and one which wants to accept usrsctp connections on its pod IP).

No you are right, tun/tap is hard to combine with load-balancing using NAT as in K8s. But it would work for load-balancing using DSR. I have abandoned the idea of multi-homed SCTP supported by kube-proxy, at least in foreseeable time, mostly because of the NAT problem.

tuexen commented 2 years ago

but this wouldn't work when you've been assigned an IP by someone else and want to do SCTP on that IP address, right? (eg, two pods in a Kubernetes cluster, one of which wants to accept lksctp connections on its pod IP, and one which wants to accept usrsctp connections on its pod IP).

No you are right, tun/tap is hard to combine with load-balancing using NAT as in K8s. But it would work for load-balancing using DSR. I have abandoned the idea of multi-homed SCTP supported by kube-proxy, at least in foreseeable time, mostly because of the NAT problem.

Asking again: Couldn't the idea described at the end of #633 solve your problem? The only drawback is that the nodes in the cluster need to have public addresses, but using IPv6 this should not be a problem.

uablrek commented 2 years ago

The load-balancer in https://github.com/sctplab/usrsctp/issues/633 terminates the association (https://github.com/sctplab/usrsctp/issues/633#issuecomment-903117159):

The load balancer right now terminates the incoming SCTP associations. The peer sending traffic doesn't know the IP addresses of the nodes behind loadbalancer. Let me describe our use case in detail.

That I call a "proxy" and I have proposed it in https://github.com/sctplab/usrsctp/issues/631#issuecomment-896620646;

The intention is to use Direct Server Return (DSR) so return traffic will not pass through the LB. But I might have to rethink that. But if so, I will propose to use a SCTP proxy rather than a L3 based load-balancer.

And there would still be a conflict between lksctp and usrsctp unless UDP encapsulation is used.

There was some discussion about whether it might be possible to relax that; eg, add a sysctl to sctp.ko to disable it, which could be set per-network-namespace.

If this is implemented raw sockets can be used in a POD (container) without conflicts.

IMO UDP encapsulation is the best solution, but application claim that the clients can't be forced to use that.

tuexen commented 2 years ago

The load-balancer in #633 terminates the association (#633 (comment)):

The load balancer right now terminates the incoming SCTP associations. The peer sending traffic doesn't know the IP addresses of the nodes behind loadbalancer. Let me describe our use case in detail.

That is not the solution I was proposing at the end. The load balancer would only deal with incoming packets containing the INIT chunks. It would not deal with any other packets.

However, client and server would need to support a small extension for INIT-FORWARDING. It could be standardised at the IETF and implemented in the Linux and FreeBSD implementations.

That I call a "proxy" and I have proposed it in #631 (comment);

The intention is to use Direct Server Return (DSR) so return traffic will not pass through the LB. But I might have to rethink that. But if so, I will propose to use a SCTP proxy rather than a L3 based load-balancer.

Using my proposal would not eve require the forward traffic (except the initial packet) to pass through the load balancer.

And there would still be a conflict between lksctp and usrsctp unless UDP encapsulation is used.

Not, if lksctp would support it. Can't you then use lksctp?

There was some discussion about whether it might be possible to relax that; eg, add a sysctl to sctp.ko to disable it, which could be set per-network-namespace.

If this is implemented raw sockets can be used in a POD (container) without conflicts.

IMO UDP encapsulation is the best solution, but application claim that the clients can't be forced to use that.

Not sure why you can't use it. It is implemented in FreeBSD and Linux. It was also not hard to implement it in FreeBSD. So it should be easy to implement in proprietary stacks...

lxin commented 2 years ago

nftables provides a way to discard the packet in the Kernel SCTP stack if you're using userland SCTP in Linux.

nft add table ip filter
nft add chain ip filter prerouting '{ type filter hook prerouting priority 0 ; policy accept ; }'
nft add rule  ip filter prerouting ip protocol sctp meta pkttype set other

nft add table ip6 filter
nft add chain ip6 filter prerouting '{ type filter hook prerouting priority 0 ; policy accept ; }'
nft add rule  ip6 filter prerouting ip6 nexthdr sctp meta pkttype set other

Thanks.

sctplab / usrsctp

Provide alternatives to raw sockets #631