skupper router fails to handle 100 rps and restarts even with 5 router instances

vnktram commented 1 year ago

Skupper is running in dedicated namespaces on two EKS clusters. One runs on interior mode and the other runs on LB.

Installation info:

skupper version
client version                 1.4.1
transport version              quay.io/skupper/skupper-router:2.4.1 (sha256:0d207c60a150)
controller version             quay.io/skupper/service-controller:1.4.1 (sha256:38de27dbae9f)
config-sync version            quay.io/skupper/config-sync:1.4.1 (sha256:2c0820947577)
flow-collector version         not-found

One instance of a static nginx pod is run on both clusters.

Plain old ingress: Pod which runs in the base (ingress controller installed) cluster handles 100 rps at ~60ms

 http_req_duration: avg=56.78ms  min=37.75ms med=53.25ms max=379.67ms p(90)=70.84ms  p(95)=75.57ms

Pod which runs on the restricted cluster where traffic is routed via skupper from the base cluster handles 100 rps at ~4.63s

http_req_duration: avg=4.63s   min=34.92ms med=66.95ms max=51.15s   p(90)=1.15s   p(95)=50.62s

On running the load test with 100 RPS, the destination site router pods restart and drop connections explicitly. This results in roughly 40% of requests being dropped.

Attaching load test results

(base is a pod directly accessed via ingress controller, router is a pod accessed via skupper routing) and logs from source router and destination router at the time of restarts. dest-router2.log destination-router.log source-router.log

ganeshmurthy commented 1 year ago

the destination site router pods restart and drop connections explicitly

The destination site router pods are restarting because the router seems to be crashing. 2.4.1 version of skupper-router prints panic information if the router crashes. It shows up something like this in the stderr and also in the log as well -

*** SKUPPER-ROUTER FATAL ERROR ***
Version: 2.4.1
Signal: 11 SIGSEGV
Process ID: 1 (skrouterd)
Thread ID: 16 (wrkr_1)
 
Backtrace:
[0] IP: 0x00007f21385ffdf0 (/lib64/libc.so.6 + 0x0000000000054df0)
    Registers:
      RAX: 0x00000000ffffffff RDI: 0x00007f2134936b70 R11: 0x00007f21349374d0
      RBX: 0x0000000000001388 RBP: 0x00007f2134938070 R12: 0x000055820337ba80
      RCX: 0x00007f21386f02db R8:  0x0000000000000000 R13: 0x0000000000000000
      RDX: 0x0000000000000000 R9:  0x0000000000000000 R14: 0x0000000000000000
      RSI: 0x0000000000000002 R10: 0x00000000000000fd R15: 0x00005582049ac4d0
      SP:  0x00007f2134937380
 
[1] IP: 0x00005582032bb98f (skrouterd + 0x00000000000c598f)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
      RBX: 0x0000000000001388 RBP: 0x00007f2134938070 R12: 0x000055820337ba80
      RCX: 0x0000000000000000 R8:  0x00007f21146e8050 R13: 0x0000000000000000
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x0000000000000000
      RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x00005582049ac4d0
      SP:  0x00007f2134938050
 
[2] IP: 0x00005582032b7a3b (skrouterd + 0x00000000000c1a3b)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
      RBX: 0x00005582049ac4d0 RBP: 0x00007f21349380a0 R12: 0x00007f211c4da088
      RCX: 0x0000000000000000 R8:  0x00007f21146e8050 R13: 0x00007f21200c2448
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f21200c2588
      RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x00005582049ac4d0
      SP:  0x00007f2134938080
 
[3] IP: 0x00005582032bfbbe (skrouterd + 0x00000000000c9bbe)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
      RBX: 0x00007f211408b4b8 RBP: 0x00007f2134938210 R12: 0x0000000000000000
      RCX: 0x0000000000000000 R8:  0x00007f21146e8050 R13: 0x00007f2114117790
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f211410ca00
      RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x00005582049ac4d0
      SP:  0x00007f21349380b0
 
[4] IP: 0x00007f213864a802 (/lib64/libc.so.6 + 0x000000000009f802)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
      RBX: 0x00007f2134939640 RBP: 0x0000000000000000 R12: 0x00007f2134939640
      RCX: 0x0000000000000000 R8:  0x00007f21146e8050 R13: 0x0000000000000002
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f213864a530
      RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x0000000000000000
      SP:  0x00007f2134938220
 
[5] IP: 0x00007f21385ea314 (/lib64/libc.so.6 + 0x000000000003f314)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
      RBX: 0x00007ffdd438d460 RBP: 0x0000000000000000 R12: 0x00007f2134939640
      RCX: 0x0000000000000000 R8:  0x00007f21146e8050 R13: 0x0000000000000002
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f213864a530
      RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x0000000000000000
      SP:  0x00007f21349382c0
 
*** END ***

I don't think you are running the 2.4.1 version of the router. I see this log line -

2023-07-04 06:44:35.121206 +0000 ROUTER (info) Version: db7622a6e828811794a5884016b5677ccff9d6e6

I suspect you are running the 2.3.2 version of the router. Please make sure you are running 2.4.1 of the router and try to reproduce this router crash and paste the output of the panic from the log. There is no need to run the routers in trace logging level for the panic to print in the logs

vnktram commented 1 year ago

@ganeshmurthy Thanks for the quick reply. I have verified that I'm running 2.4.1 of skupper-router while facing this issue.

ganeshmurthy commented 1 year ago

Are you saving the logs of the destination site pods after they restart ? Do you see any panic output in those logs ? If there is no panic output in those logs, it might mean that the router is running out of memory. Can you please monitor the router memory by running the skstat -m and skstat -g commands against the router ? Run it in equal intervals of time before the skupper-router pod restarts.

vnktram commented 1 year ago

@ganeshmurthy Running skstat during load test does not help since it errors out trying to connect over amqp. Running kubectl top pods shows that the pods reach roughly 1 core CPU (which is the provisioned request) before restarting.

vnktram commented 1 year ago

Additionally, no panic logs are printed on router restarts

vnktram commented 1 year ago

Is it expected that skupper router requires > 1Gb memory and >1 core of CPU for roughly 20 RPS per router?

jirkadanek commented 1 year ago

Is it expected that skupper router requires > 1Gb memory and >1 core of CPU for roughly 20 RPS per router?

Router is organized around connections, not requests. So, how many connections is there, both connections made to router and connections made by the router?

~Can you post skstat -l statistics for the routers involved?~

In addition, router reports VmSize in it's memory stats, not RSS, which is probably the more relevant measure of memory.

skupperproject / skupper

skupper router fails to handle 100 rps and restarts even with 5 router instances #1158