loxilb-io / loxilb

eBPF based cloud-native load-balancer for Kubernetes|Edge|Telco|IoT|XaaS.
https://www.loxilb.io
Apache License 2.0
1.5k stars 122 forks source link

SCTP traffic not going back to originator in FullNAT mode #239

Closed samassalt closed 1 year ago

samassalt commented 1 year ago

Hi, I have encountered an issue with SCTP traffic.

We are using LoxiLB on an EC2 instance as the SCTP load balancer for our EKS cluster. SCTP traffic is coming in and routed to the cluster and comes back to the LB. From there the SCTP traffic is not routed back to the originator. The exact same setup works with TCP traffic (as described here: https://www.loxilb.io/post/loxilb-load-balancer-setup-on-eks).

This is our setup in detail: image

LoxiLB EC2 instance has one NIC for both public and private IP (did test with a separate NIC as well). The OS is Debian and LoxiLB is deployed with Docker. The instance runs in a separate subnet, all SG are setup to let SCTP traffic through (validated). In the EKS cluster kube-loxilb is used to control the LB settings. k8s Loadbalancer is set in FullNat mode, tried other modes as well. AWS EKS uses the standard AWS VPC CNI.

I followed all instructions to make sure the kernel support SCTP and AWS SG and ACL are all setup to allow SCTP traffic. Direct SCTP traffic to the cluster without LoxiLB works as well (hitting the same endpoints).

This is the traffic flow: '23:04:46.614223 ens5 In IP 3.138.ext.ip.34957 > 10.2.loxi.ip.36412: sctp (1) [INIT] [init tag: 2497271463] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 2728045056]' '23:04:46.614256 ens5 Out IP 10.2.loxi.ip.34957 > 10.2.eks.ip.30901: sctp (1) [INIT] [init tag: 2497271463] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 2728045056]' '23:04:46.614551 ens5 In IP 10.2.eks.ip.30901 > 10.2.loxi.ip.34957: sctp (1) [INIT ACK] [init tag: 398583562] [rwnd: 106496] [OS: 30] [MIS: 10] [init TSN: 2518125576]`

Please let me know what we have missed or how to provide you with more information. Thank you in advance for any help!

Other references used: https://futuredon.medium.com/5g-sctp-loadbalancer-using-loxilb-b525198a9103 https://loxilb-io.github.io/loxilbdocs/kube-loxilb/ https://loxilb-io.github.io/loxilbdocs/run/ (Docker version)

nik-netlox commented 1 year ago

Hi Stephan, Thanks for trying LoxiLB! We are looking into your issue. Since, we would need more information, Request you to connect with us through our member's channel. We will be able to assist you better.

infinitydon commented 1 year ago

Can you try to create both the loxilb EC2 instance and the worker-nodes in the same subnet?

This is what works for me currently..

My eksctl config:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: loxilb-sctp
  region: us-west-2
  version: "1.24"
iam:
   withOIDC: true
vpc:
 nat:
   gateway: Single
availabilityZones: ["us-west-2a", "us-west-2b"]
managedNodeGroups:
  - name: workers
    instanceType: t3.medium
    ssh:
      allow: true
      publicKeyName: local-key
    desiredCapacity: 3
    availabilityZones: ["us-west-2a"]

I created the Loxilb in us-west-2a and the same subnet.

[ec2-user@ip-172-31-18-255 manifest]$ kubectl -n open5gs get svc | grep sctp
core5g-amf-sctp      LoadBalancer   10.100.57.111    192.168.0.46   31145:32238/SCTP   15m
ubuntu@ip-172-31-7-225:~/UERANSIM/build$ ./nr-gnb -c gnb.yaml
UERANSIM v3.2.6
[2023-03-03 22:38:19.539] [sctp] [info] Trying to establish SCTP connection... (xx.2xx.19.2:31145)
[2023-03-03 22:38:19.543] [sctp] [info] SCTP connection established ([34.222.19.2:31145](xx.2xx.19.2:31145))
[2023-03-03 22:38:19.543] [sctp] [debug] SCTP association setup ascId[7]
[2023-03-03 22:38:19.543] [ngap] [debug] Sending NG Setup Request
[2023-03-03 22:38:19.545] [ngap] [debug] NG Setup Response received
[2023-03-03 22:38:19.545] [ngap] [info] NG Setup procedure is successful
root@core5g-amf-1-deployment-6d457b6888-nvnh8:/#
root@core5g-amf-1-deployment-6d457b6888-nvnh8:/# tcpdump -i any sctp -s0 -nv
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
22:38:19.542200 eth0  In  IP (tos 0x2,ECT(0), ttl 62, id 0, offset 0, flags [DF], proto SCTP (132), length 68)
    192.168.13.2.5083 > 192.168.20.99.38412: sctp (1) [INIT] [init tag: 1440608999] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 2670472195]
22:38:19.542249 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 292)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [INIT ACK] [init tag: 3400320274] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 3283050420]
22:38:19.543047 eth0  In  IP (tos 0x2,ECT(0), ttl 62, id 0, offset 0, flags [DF], proto SCTP (132), length 264)
    192.168.13.2.5083 > 192.168.20.99.38412: sctp (1) [COOKIE ECHO]
22:38:19.543077 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 36)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [COOKIE ACK]
22:38:19.544145 eth0  In  IP (tos 0x2,ECT(0), ttl 62, id 0, offset 0, flags [DF], proto SCTP (132), length 120)
    192.168.13.2.5083 > 192.168.20.99.38412: sctp (1) [DATA] (B)(E) [TSN: 2670472195] [SID: 0] [SSEQ 0] [PPID 0x3c]
22:38:19.544164 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 12175, offset 0, flags [DF], proto SCTP (132), length 48)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [SACK] [cum ack 2670472195] [a_rwnd 106424] [#gap acks 0] [#dup tsns 0]
22:38:19.544417 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 12176, offset 0, flags [DF], proto SCTP (132), length 108)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [DATA] (B)(E) [TSN: 3283050420] [SID: 0] [SSEQ 0] [PPID 0x3c]
22:38:19.545042 eth0  In  IP (tos 0x2,ECT(0), ttl 62, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
    192.168.13.2.5083 > 192.168.20.99.38412: sctp (1) [SACK] [cum ack 3283050420] [a_rwnd 106439] [#gap acks 0] [#dup tsns 0]
22:38:26.122155 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 12177, offset 0, flags [DF], proto SCTP (132), length 84)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [HB REQ]
22:38:26.123221 eth0  In  IP (tos 0x2,ECT(0), ttl 62, id 0, offset 0, flags [DF], proto SCTP (132), length 84)
    192.168.13.2.5083 > 192.168.20.99.38412: sctp (1) [HB ACK]
22:38:32.266149 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 12178, offset 0, flags [DF], proto SCTP (132), length 84)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [HB REQ]
22:38:32.267026 eth0  In  IP (tos 0x2,ECT(0), ttl 62, id 0, offset 0, flags [DF], proto SCTP (132), length 84)
    192.168.13.2.5083 > 192.168.20.99.38412: sctp (1) [HB ACK]
22:38:38.666151 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 12179, offset 0, flags [DF], proto SCTP (132), length 84)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [HB REQ]
22:38:38.667163 eth0  In  IP (tos 0x2,ECT(0), ttl 62, id 0, offset 0, flags [DF], proto SCTP (132), length 84)
    192.168.13.2.5083 > 192.168.20.99.38412: sctp (1) [HB ACK]
22:38:45.322152 eth0  Out IP (tos 0x2,ECT(0), ttl 64, id 12180, offset 0, flags [DF], proto SCTP (132), length 84)
    192.168.20.99.38412 > 192.168.13.2.5083: sctp (1) [HB REQ]
root@ip-192-168-0-46:/# loxicmd get ct
| DESTINATIONIP |   SOURCEIP   | DESTINATIONPORT | SOURCEPORT | PROTOCOL | STATE  |                   ACT                    | PACKETS | BYTES |
|---------------|--------------|-----------------|------------|----------|--------|------------------------------------------|---------|-------|
| 192.168.0.46  | 192.168.0.1  |              68 |         67 | udp      | closed |                                          |       1 |   351 |
| 192.168.0.46  | 18.xxx.x6.97 |           31145 |      39271 | sctp     | est    | fdnat-192.168.0.46,192.168.13.2:32238:w0 |     105 | 10418 |
| 192.168.0.46  | 192.168.13.2 |           39271 |      32238 | sctp     | est    | fsnat-192.168.0.46,18.xxx.x6.97:31145:w0 |     105 | 10406 |
| 192.168.0.1   | 192.168.0.46 |              67 |         68 | udp      | closed |                                          |       0 |     0 |

Also I did not disable sctp checksum. I have not tested with the checksum enabled extensively, you can also try it if it works..

backguynn commented 1 year ago

@samassalt I configured EKS and tested SCTP as described in the blog post. (EC2 instance and EKS nodes in the same subnet), That scenario worked for me.

While testing with several external clients, there was a case where the firewall prevented me from receiving SCTP response packets like in your case.

Can you check your firewall settings?

If there are no other problems, please let us know your loxilb settings so we can help you further. run this command in loxilb container:

loxicmd get lb -o wide
loxicmd get ct

Thanks.

UltraInstinct14 commented 1 year ago

The issue was finally traced to incompatible linux kernel version. OP's loxilb node kernel version was 5.10 which had problems to handle SCTP (with eBPF). Issue resolved after upgrading to kernel 6.10. Ideally any linux kernel version >= 5.15 should work fine.