Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire

alexmasi commented 1 year ago

I am having some issues when using meshnet with grpc-wire.

I have a >300 node topology (running in KNE) spread across a 5 node k8 cluster (4-workers).

Many of the router pods get stuck in Init:0/1 state and some of the meshnet pods get stuck in a crash loop:

$ kubectl get pods -n meshnet -o wide
NAME            READY   STATUS             RESTARTS           AGE    IP            NODE         NOMINATED NODE   READINESS GATES
meshnet-6gzft   1/1     Running            0                  4d9h   10.240.0.11   controller   <none>           <none>
meshnet-6vqtf   1/1     Running            0                  4d9h   10.240.0.21   worker-1     <none>           <none>
meshnet-849jl   1/1     Running            279 (3d2h ago)     4d9h   10.240.0.23   worker-3     <none>           <none>
meshnet-g7swq   0/1     CrashLoopBackOff   1141 (108s ago)    4d9h   10.240.0.24   worker-4     <none>           <none>
meshnet-pwr6k   0/1     CrashLoopBackOff   1176 (4m56s ago)   4d9h   10.240.0.22   worker-2     <none>           <none>

$ kubectl logs meshnet-g7swq -n meshnet
Distributing files
Starting meshnetd daemon
time="2022-12-13T02:24:29Z" level=info msg="Trying in-cluster configuration"
time="2022-12-13T02:24:29Z" level=info msg="[core] [Server #1] Server created" system=system
time="2022-12-13T02:24:29Z" level=info msg="Starting meshnet daemon...with grpc support"
time="2022-12-13T02:24:29Z" level=info msg="GRPC server has started on port: 51111"
time="2022-12-13T02:24:29Z" level=info msg="[core] [Server #1 ListenSocket #2] ListenSocket created" system=system
time="2022-12-13T02:25:03Z" level=info msg="Error creating vEth pair (in:eth36 <--> out:bx04reth36-0001).  Error-> failed to rename link koko3901300170 -> eth36: file exists"
time="2022-12-13T02:25:03Z" level=error msg="AddWireRemote err : failed to rename link koko3901300170 -> eth36: file exists"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x140d56a]

goroutine 98 [running]:
github.com/networkop/meshnet-cni/daemon/meshnet.(*Meshnet).AddGRPCWireRemote(0x0?, {0x0?, 0x0?}, 0x0?)
    /go/src/github.com/networkop/meshnet-cni/daemon/meshnet/handler.go:348 +0xaa
...

I see that the failed pod:eth (bx04r:eth36) is reverse-skipped by the peer pod on that link (bx03s:eth1): Reverse-skipping of pod bx04rno10 by pod bx03sfo03

excerpt from KNE topology:

links: {
  a_node: "bx03sfo03"
  a_int: "eth1"
  z_node: "bx04rno10"
  z_int: "eth36"
}

However I do not see an entry in the ip link table for bx03s* even though bx03 is stuck in Init state on a worker with a healthy meshnet pod.

On another note, even the running pods report strange errors (which seem like simple no-op bugs upon inspection):

$ kubectl logs meshnet-6vqtf -n meshnet | grep bx07iad60 | head -100
time="2022-12-08T16:46:47Z" level=info msg="Retrieving bx07iad60's metadata from K8s..."
time="2022-12-08T16:46:47Z" level=info msg="Reading pod bx07iad60 from K8s"
time="2022-12-08T16:46:47Z" level=info msg="Skipping of pod bx07iad60 by pod bx09iad60"
time="2022-12-08T17:04:59Z" level=info msg="Retrieving bx07iad60's metadata from K8s..."
time="2022-12-08T17:04:59Z" level=info msg="Reading pod bx07iad60 from K8s"
time="2022-12-08T17:04:59Z" level=info msg="Setting bx07iad60's SrcIp=10.240.0.21 and NetNs=/proc/11867/ns/net"
time="2022-12-08T17:04:59Z" level=info msg="Reading pod bx07iad60 from K8s"
time="2022-12-08T17:05:00Z" level=info msg="Update pod status bx07iad60 from K8s"
time="2022-12-08T17:05:01Z" level=info msg="Checking if bx06iad23 is skipped by bx07iad60"
time="2022-12-08T17:05:03Z" level=info msg="Checking if bx09iad60 is skipped by bx07iad60"
time="2022-12-08T17:05:04Z" level=info msg="Checking if bx06iad30 is skipped by bx07iad60"
time="2022-12-08T17:05:05Z" level=info msg="Removing grpc-wire for pod arista-kne-topo:bx07iad60"
time="2022-12-08T17:05:06Z" level=info msg="Removed all grpc-wire for pod: eth6@bx07iad60"
time="2022-12-08T17:05:06Z" level=error msg="finished unary call with code Unknown" error="failed to remove gRPC wire for pod bx07iad60: %!w(<nil>)" grpc.code=Unknown grpc.method=RemGRPCWire grpc.service=meshnet.v1beta1.Local grpc.start_time="2022-12-08T17:05:05Z" grpc.time_ms=94.334 peer.address="127.0.0.1:40130" span.kind=server system=grpc

In the worker nodes with the failed meshnet pods I see many koko links waiting to be renamed:

worker-2:~$ ip -d link show | grep koko | head
1024: koko2916450998@if1025: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
3073: koko2154029822@if3074: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2817: koko3259706345@if2818: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2561: koko832348741@if2562: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2305: koko2799390267@if2306: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2049: koko1542542355@if2050: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
1793: koko4212678092@if1794: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
1281: koko1342903858@if1282: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
3075: koko497316275@if3076: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2819: koko1392270664@if2820: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default

Any help in debugging this further would be appreciated

alexmasi commented 1 year ago

After using the container with the debug fixes I am seeing most of the pods come up (306/337). Now only 31 are stuck in an Init state.

Seeing the following potential issues:

Many of these messages across all workers:

e.g.

meshnet-nq2n6 (worker-2):

time="2022-12-13T22:24:56Z" level=info msg="SendToOnce (wire id - 77): Could not find local handle. err:interface 77 is not active"
time="2022-12-13T22:24:56Z" level=error msg="finished unary call with code Unknown" error="interface 77 is not active" grpc.code=Unknown grpc.method=SendToOnce grpc.service=meshnet.v1beta1.WireProtocol grpc.start_time="2022-12-13T22:24:56Z" grpc.time_ms=0.093 peer.address="10.240.0.24:37018" span.kind=server system=grpc

coming from here: https://github.com/networkop/meshnet-cni/blob/d3ae64833f4710c0ba1810737df0bd5de414844f/daemon/grpcwire/grpcwire.go#L254

Seeing errors like the following, reading through the source code maybe this should just be info instead of error?

time="2022-12-13T22:23:34Z" level=error msg="Appending new element bx02cbf40"

Seeing a single occurrence of this failure:

meshnet-njd5r (worker-1):

time="2022-12-13T22:22:01Z" level=info msg="Error creating vEth pair (in:eth14 <--> out:bx02deth14-0003).  Error-> failed to Statfs \"/proc/20948/ns/net\": no such file or directory"
time="2022-12-13T22:22:01Z" level=error msg="AddWireRemote err: failed to Statfs \"/proc/20948/ns/net\": no such file or directory"
time="2022-12-13T22:22:01Z" level=error msg="finished unary call with code Unknown" error="failed to Statfs \"/proc/20948/ns/net\": no such file or directory" grpc.code=Unknown grpc.method=AddGRPCWireRemote grpc.service=meshnet.v1beta1.Remote grpc.start_time="2022-12-13T22:22:01Z" grpc.time_ms=106.531 peer.address="10.240.0.23:35042" span.kind=server system=grpc

coming from this chain:

alexmasi commented 1 year ago

Another thing I observed is two reboots of meshnet pods, one due to OOM and the other was Error status. My concern is that the "missing" interfaces were stored in local memory and did not persist across the reboot: https://github.com/networkop/meshnet-cni/blob/d3ae64833f4710c0ba1810737df0bd5de414844f/daemon/grpcwire/grpcwire.go#L143

networkop commented 1 year ago

After using the container with the debug fixes I am seeing most of the pods come up (306/337). Now only 31 are stuck in an Init state.

That's good news. I can merge the PR or wait until you add more to it? let me know

Seeing errors like the following, reading through the source code maybe this should just be info instead of error?

agreed

Seeing a single occurrence of this failure:

this looks weird. is there a chance a pod got deleted while the CNI plugin was still processing it?

Another thing I observed is two reboots of meshnet pods, one due to OOM and the other was Error status. My concern is that the "missing" interfaces were stored in local memory and did not persist across the reboot:.

You're most likely right. Looks like if meshnetd is restarted it would have no prior knowledge of existing grpc wires. Should this somehow be persisted on disk or api-server? I guess the pcap.Handles still need to be properly cleaned up and re-opened on pod restart.

alexmasi commented 1 year ago

I added the one more change to downgrade the log from error to info. I think this should be merged as is.

I spoke with Keysight and they have a branch fix-race on their fork that combined with memory resource limit increase (200Mi to 1000Mi) for the meshnet pod I was able to get it to work.

Hopefully they can get that merged into this repo.

I will open a separate issue as an FR to add proper reconciliation for grpc-wires (whether that be through the topology CRD or one of the methods you mention)

alexmasi commented 1 year ago

@kingshukdev for visibility

alexmasi commented 1 year ago

Closing in favor of reconciliation FR: https://github.com/networkop/meshnet-cni/issues/57

networkop / meshnet-cni

Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55