loxilb-io / loxilb

eBPF based cloud-native load-balancer for Kubernetes|Edge|Telco|IoT|XaaS.
https://www.loxilb.io
Apache License 2.0
1.36k stars 108 forks source link

SCTP FullNAT maximum connections issue #328

Closed xplaa closed 1 year ago

xplaa commented 1 year ago

I am trying to test the following SCTP connection scheme: sheme

The loxilb configuration is the following:

{
  "lbAttr": [
    {
      "endpoints": [{"endpointIP": "10.49.0.95","state": "active","targetPort": 17028,"weight": 1}],
      "secondaryIPs": null,
      "serviceArguments":
      {
        "externalIP": "10.49.0.166",
        "inactiveTimeOut": 240,
        "mode": 2,
        "port": 2020,
        "protocol": "sctp"
      }
    }
  ]
}

The goal of my circuit is to test as many connections as possible through over loxilb. Each client creates 20k SCTP connections. Problems start when on the server side, there are about 25k connections:

root@test:/home/test# netstat -an | grep ESTABLISHED | grep sctp | wc -l
25434

At this time, from the client side, I am getting errors - "Connection reset by peer" (errno 104). Clients cannot create more connections. In the server logs, messages of the following format:

##10.49.0.95:17028 -> 10.49.0.166:43277 (132):0 (Aged:0:0:0)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.220:51694 (132)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.220:34058 (132)
rdir ct4 not found 10.49.0.95:17028 -> 10.49.0.166:43277 (132)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.220:45329 (132)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.220:35175 (132)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.220:34678 (132)
##10.49.0.220:37717 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.110:51970 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.220:57307 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.220:35820 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.220:45156 (132)
##10.49.0.110:34344 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.110:46862 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.95:17028 -> 10.49.0.166:45637 (132):0 (Aged:0:0:0)
##10.49.0.220:51884 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.110:44070 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.220:46913 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
##10.49.0.110:47142 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)
rdir ct4 not found 10.49.0.166:47142 -> 10.49.0.95:2020 (132)
rdir ct4 not found 10.49.0.95:17028 -> 10.49.0.166:49793 (132)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.220:49134 (132)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.110:33510 (132)
rdir ct4 not found 10.49.0.166:17028 -> 10.49.0.110:51954 (132)
##10.49.0.110:58821 -> 10.49.0.166:2020 (132):1 (Aged:0:1:0)

Perhaps I need to adjust the size of the eBPF map or perform another system tuning? I did not find a recommendation about this in the documentation.

nik-netlox commented 1 year ago

Hi @xplaa, Can you let me know about your testing tools?

xplaa commented 1 year ago

Hi @nik-netlox, As client and server I use examples (tcp_sctp_server_demo, tcp_sctp_client_demo) with minor modifications from seastar source codes. I should note, that without using loxilb, the server holds 40k connections from 2 clients.

TrekkieCoder commented 1 year ago

In full nat mode, due to sourceIP rewrite, traffic from two clients can clash unless l4 source port are spread out between the clients(We will have 64k choices at max) . So, you can try:

  1. Instrumenting your clients to select from a diverse pool of source ports or
  2. Create multuple fullnat rules for same server and connect to different VIPs from different clients.
xplaa commented 1 year ago

What is the reason for this limitation? Is it possible to fix? I need a scheme when all clients connect to the one port of loxilb server.

TrekkieCoder commented 1 year ago

This is a fundamental problem with fullnat/snat. Linux and other OS's do well to select random ephemeral source ports. Another workaround is to use default NAT mode of loxilb (not fullNAT) where source IP is preserved. But in default mode with SCTP, the server and client apps need to make sure they are strictly binding to a single system IP address to work properly.

xplaa commented 1 year ago

@TrekkieCoder, thank you for your responses.

I tried using Normal NAT (in cfg file mode = 0). All SCTP connections in init status. Is there anything else I need to add to the config file? mode0

Although this mode One-ARM works fine.

Finally I have 2 questions:

  1. Why is Normal NAT mode not working?
  2. Are there any 64k limits for One-ARM mode?
TrekkieCoder commented 1 year ago

Hi @xplaa

  1. Not sure why normal NAT is not working. Can you please confirm if all the nodes(client, loxilb and server) are in same subnet (10.49.0.0/16) ? We will try to reproduce this !
  2. Yes one-arm will also have 64k limitation.
xplaa commented 1 year ago
  1. Yes, all the nodes are in same subnet
PacketCrunch commented 1 year ago

I would like to chip in. If all nodes are in same subnet only fullnat/one-arm mode will work. If default NAT mode 0 (also known as two-arm mode) is required, then a routed network will be needed. Each arm(in/out) being in a different subnet. IMO, if you lay out the topology like the following and configure appropriate LB rule, it should work fine

graph LR
    Client1_10.50.0.10--- |10.50.0.x/16|10.50.0.12_loxilb_10.49.0.166
    Client2_10.50.0.11--- |10.50.0.x/16|10.50.0.12_loxilb_10.49.0.166
    10.50.0.12_loxilb_10.49.0.166 ---- |10.49.0.x/16| echoserver

You also need to make sure echoserver has a route to reach the client subnet (10.50.0.0/16) in the above.

xplaa commented 1 year ago

H1 @PacketCrunch

Thanks for the description and diagram. Looks like a networking error on my side. I'll try to set it up and post the results.

xplaa commented 1 year ago

You also need to make sure echoserver has a route to reach the client subnet (10.50.0.0/16) in the above.

Yes, it really was my fault NormalNAT actually works as I need. Thanks everyone for the replies!