Open vnktram opened 1 year ago
the destination site router pods restart and drop connections explicitly
The destination site router pods are restarting because the router seems to be crashing. 2.4.1 version of skupper-router prints panic information if the router crashes. It shows up something like this in the stderr and also in the log as well -
*** SKUPPER-ROUTER FATAL ERROR ***
Version: 2.4.1
Signal: 11 SIGSEGV
Process ID: 1 (skrouterd)
Thread ID: 16 (wrkr_1)
Backtrace:
[0] IP: 0x00007f21385ffdf0 (/lib64/libc.so.6 + 0x0000000000054df0)
Registers:
RAX: 0x00000000ffffffff RDI: 0x00007f2134936b70 R11: 0x00007f21349374d0
RBX: 0x0000000000001388 RBP: 0x00007f2134938070 R12: 0x000055820337ba80
RCX: 0x00007f21386f02db R8: 0x0000000000000000 R13: 0x0000000000000000
RDX: 0x0000000000000000 R9: 0x0000000000000000 R14: 0x0000000000000000
RSI: 0x0000000000000002 R10: 0x00000000000000fd R15: 0x00005582049ac4d0
SP: 0x00007f2134937380
[1] IP: 0x00005582032bb98f (skrouterd + 0x00000000000c598f)
Registers:
RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
RBX: 0x0000000000001388 RBP: 0x00007f2134938070 R12: 0x000055820337ba80
RCX: 0x0000000000000000 R8: 0x00007f21146e8050 R13: 0x0000000000000000
RDX: 0x0000000000000001 R9: 0x0000000000000000 R14: 0x0000000000000000
RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x00005582049ac4d0
SP: 0x00007f2134938050
[2] IP: 0x00005582032b7a3b (skrouterd + 0x00000000000c1a3b)
Registers:
RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
RBX: 0x00005582049ac4d0 RBP: 0x00007f21349380a0 R12: 0x00007f211c4da088
RCX: 0x0000000000000000 R8: 0x00007f21146e8050 R13: 0x00007f21200c2448
RDX: 0x0000000000000001 R9: 0x0000000000000000 R14: 0x00007f21200c2588
RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x00005582049ac4d0
SP: 0x00007f2134938080
[3] IP: 0x00005582032bfbbe (skrouterd + 0x00000000000c9bbe)
Registers:
RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
RBX: 0x00007f211408b4b8 RBP: 0x00007f2134938210 R12: 0x0000000000000000
RCX: 0x0000000000000000 R8: 0x00007f21146e8050 R13: 0x00007f2114117790
RDX: 0x0000000000000001 R9: 0x0000000000000000 R14: 0x00007f211410ca00
RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x00005582049ac4d0
SP: 0x00007f21349380b0
[4] IP: 0x00007f213864a802 (/lib64/libc.so.6 + 0x000000000009f802)
Registers:
RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
RBX: 0x00007f2134939640 RBP: 0x0000000000000000 R12: 0x00007f2134939640
RCX: 0x0000000000000000 R8: 0x00007f21146e8050 R13: 0x0000000000000002
RDX: 0x0000000000000001 R9: 0x0000000000000000 R14: 0x00007f213864a530
RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x0000000000000000
SP: 0x00007f2134938220
[5] IP: 0x00007f21385ea314 (/lib64/libc.so.6 + 0x000000000003f314)
Registers:
RAX: 0x0000000000000000 RDI: 0x000055820337ba80 R11: 0x0000000000000246
RBX: 0x00007ffdd438d460 RBP: 0x0000000000000000 R12: 0x00007f2134939640
RCX: 0x0000000000000000 R8: 0x00007f21146e8050 R13: 0x0000000000000002
RDX: 0x0000000000000001 R9: 0x0000000000000000 R14: 0x00007f213864a530
RSI: 0x0000000000000000 R10: 0x0000000000004000 R15: 0x0000000000000000
SP: 0x00007f21349382c0
*** END ***
I don't think you are running the 2.4.1 version of the router. I see this log line -
2023-07-04 06:44:35.121206 +0000 ROUTER (info) Version: db7622a6e828811794a5884016b5677ccff9d6e6
I suspect you are running the 2.3.2 version of the router. Please make sure you are running 2.4.1 of the router and try to reproduce this router crash and paste the output of the panic from the log. There is no need to run the routers in trace logging level for the panic to print in the logs
@ganeshmurthy Thanks for the quick reply. I have verified that I'm running 2.4.1 of skupper-router while facing this issue.
Are you saving the logs of the destination site pods after they restart ? Do you see any panic output in those logs ? If there is no panic output in those logs, it might mean that the router is running out of memory. Can you please monitor the router memory by running the skstat -m
and skstat -g
commands against the router ? Run it in equal intervals of time before the skupper-router pod restarts.
@ganeshmurthy Running skstat
during load test does not help since it errors out trying to connect over amqp.
Running kubectl top pods
shows that the pods reach roughly 1 core CPU (which is the provisioned request) before restarting.
Additionally, no panic logs are printed on router restarts
Is it expected that skupper router requires > 1Gb memory and >1 core of CPU for roughly 20 RPS per router?
Is it expected that skupper router requires > 1Gb memory and >1 core of CPU for roughly 20 RPS per router?
Router is organized around connections, not requests. So, how many connections is there, both connections made to router and connections made by the router?
~Can you post skstat -l statistics for the routers involved?~
In addition, router reports VmSize in it's memory stats, not RSS, which is probably the more relevant measure of memory.
Skupper is running in dedicated namespaces on two EKS clusters. One runs on interior mode and the other runs on LB.
Installation info:
One instance of a static nginx pod is run on both clusters.
Plain old ingress: Pod which runs in the base (ingress controller installed) cluster handles 100 rps at ~60ms
Pod which runs on the restricted cluster where traffic is routed via skupper from the base cluster handles 100 rps at ~4.63s
On running the load test with 100 RPS, the destination site router pods restart and drop connections explicitly. This results in roughly 40% of requests being dropped.
Attaching load test results
(base is a pod directly accessed via ingress controller, router is a pod accessed via skupper routing) and logs from source router and destination router at the time of restarts. dest-router2.log destination-router.log source-router.log