networkop / meshnet-cni

a (K8s) CNI plugin to create arbitrary virtual network topologies
BSD 3-Clause "New" or "Revised" License
116 stars 28 forks source link

Bug: Meshnet `cmdDel` is not idempotent, leading to state changes when CNI ADD and CNI DEL commands overlap #70

Closed Cerebus closed 1 year ago

Cerebus commented 1 year ago

I need help debugging this, b/c I really don't know what's going on.

When I create pods connected with meshnet, sometimes meshnet doesn't add the declared interfaces. This can happen at deployment or redeployment of a static pod. I set up my static pods with init containers that wait on the expected interfaces to be added, so it's easy to see when this happens--the Pod gets stuck in Init:0.

What I need to be able to do is reliably delete and restore static Pods from a topology. I do this by deleting the Pod, and /most/ of the time this works flawlessly--but sometimes it doesn't. It's difficult to reproduce this reliably. When it happens, I can sometimes recover by deleting the entire topology (all pods and topos) and re-deploying it. Sometimes the condition persists and I have to rollout restart meshnet. And sometimes I have to reboot the node or cluster to recover.

Looking at the meshnet log, when this happens meshnet thinks the other pod isn't up. As an example, at startup lf1 and lf5 link fine:

time="2023-03-22T15:27:06Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-83b86931-410c-4f80-9846-bf9539cee642 and intfName: eth4, IP:192.168.0.6/31"
time="2023-03-22T15:27:06Z" level=info msg="Pod lf1 is retrieving peer pod lf5 information from meshnet daemon"
time="2023-03-22T15:27:06Z" level=info msg="Is peer pod lf5 alive?: true"
time="2023-03-22T15:27:06Z" level=info msg="Peer pod lf5 is alive"
time="2023-03-22T15:27:06Z" level=info msg="lf1 and lf5 are on the same host"

Then I delete pod lf1, wait a bit, then re-deploy lf1 from the same resource declaration. Then I get:

time="2023-03-22T15:39:06Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-fff6bab6-5bb0-e511-f423-282bde768df2 and intfName: eth4, IP:192.168.0.6/31"
time="2023-03-22T15:39:06Z" level=info msg="Pod lf1 is retrieving peer pod lf5 information from meshnet daemon"
time="2023-03-22T15:39:06Z" level=info msg="Is peer pod lf5 alive?: false"
time="2023-03-22T15:39:06Z" level=info msg="Peer pod lf5 isn't alive yet, continuing"

meshnetd's log looks normal:

2023-03-22T10:39:06-05:00   time="2023-03-22T15:39:06Z" level=info msg="Retrieving lf5's metadata from K8s..." daemon=meshnetd
2023-03-22T10:39:06-05:00   time="2023-03-22T15:39:06Z" level=info msg="Reading pod lf5 from K8s" daemon=meshnetd
2023-03-22T10:39:06-05:00   time="2023-03-22T15:39:06Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Get grpc.service=meshnet.v1beta1.Local grpc.start_time="2023-03-22T15:39:06Z" grpc.time_ms=8.889 peer.address="[::1]:46588" span.kind=server system=grpc
2023-03-22T10:39:06-05:00   time="2023-03-22T15:39:06Z" level=info msg="Skipping of pod lf5 by pod lf1" daemon=meshnetd
2023-03-22T10:39:06-05:00   time="2023-03-22T15:39:06Z" level=info msg="Reading pod lf1 from K8s" daemon=meshnetd
2023-03-22T10:39:06-05:00   time="2023-03-22T15:39:06Z" level=info msg="Update pod status lf1 from K8s" daemon=meshnetd
2023-03-22T10:39:06-05:00   time="2023-03-22T15:39:06Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Skip grpc.service=meshnet.v1beta1.Local grpc.start_time="2023-03-22T15:39:06Z" grpc.time_ms=27.629 peer.address="[::1]:46588" span.kind=server system=grpc

I've tried /also/ deleting the lf1 and/or lf5 topo, but that doesn't seem to have an effect.

If I recover and redeploy and eventually recreate this condition, the affected pods or interfaces will change even with the same topology and pod declarations.

Am I bumping up against some resource limit somewhere?

networkop commented 1 year ago

This looks strange. Technically, based on how I understand it, the liveliness of a pod is determined by its srcIP and netNS fields: https://github.com/networkop/meshnet-cni/blob/master/plugin/meshnet.go#L235

So I think the key is to understand why lf1 thinks that lf5 is not alive. Can you get the full output of lf5's custom resource during the failure condition? curious if these fields get rewritten/deleted for some reason.

Cerebus commented 1 year ago

Ok, here goes. Same emulation topology, different nodes this time. lf4 is stuck, so deleted it and recreated it so I can get the right meshnet-cni.log:

time="2023-03-28T20:03:02Z" level=info msg="INTER_NODE_LINK_TYPE: GRPC"
time="2023-03-28T20:03:02Z" level=info msg="Parsing cni .conf file"
time="2023-03-28T20:03:02Z" level=info msg="Parsing CNI_ARGS environment variable"
time="2023-03-28T20:03:02Z" level=info msg="Processing ADD POD lf4 in namespace default"
time="2023-03-28T20:03:02Z" level=info msg="Attempting to connect to local meshnet daemon"
time="2023-03-28T20:03:02Z" level=info msg="Retrieving local pod information from meshnet daemon"
time="2023-03-28T20:03:02Z" level=info msg="Found iface eth0 for address 192.168.5.15"
time="2023-03-28T20:03:02Z" level=info msg="VxLan route is via 192.168.5.15@eth0"
time="2023-03-28T20:03:02Z" level=info msg="Setting pod alive status on meshnet daemon"
time="2023-03-28T20:03:02Z" level=info msg="Starting to traverse all links"
time="2023-03-28T20:03:02Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-c0744d0d-3cf5-a1d6-541b-5f7b14884f8e and intfName: eth1, IP:192.168.0.5/31"
time="2023-03-28T20:03:02Z" level=info msg="Pod lf4 is retrieving peer pod lf1 information from meshnet daemon"
time="2023-03-28T20:03:02Z" level=info msg="Is peer pod lf1 alive?: false"
time="2023-03-28T20:03:02Z" level=info msg="Peer pod lf1 isn't alive yet, continuing"
time="2023-03-28T20:03:02Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-c0744d0d-3cf5-a1d6-541b-5f7b14884f8e and intfName: eth2, IP:192.168.0.25/31"
time="2023-03-28T20:03:02Z" level=info msg="Pod lf4 is retrieving peer pod lf3 information from meshnet daemon"
time="2023-03-28T20:03:02Z" level=info msg="Is peer pod lf3 alive?: false"
time="2023-03-28T20:03:02Z" level=info msg="Peer pod lf3 isn't alive yet, continuing"
time="2023-03-28T20:03:02Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-c0744d0d-3cf5-a1d6-541b-5f7b14884f8e and intfName: eth3, IP:192.168.0.26/31"
time="2023-03-28T20:03:02Z" level=info msg="Pod lf4 is retrieving peer pod lf5 information from meshnet daemon"
time="2023-03-28T20:03:02Z" level=info msg="Is peer pod lf5 alive?: false"
time="2023-03-28T20:03:02Z" level=info msg="Peer pod lf5 isn't alive yet, continuing"
time="2023-03-28T20:03:03Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-c0744d0d-3cf5-a1d6-541b-5f7b14884f8e and intfName: eth4, IP:192.168.0.28/31"
time="2023-03-28T20:03:03Z" level=info msg="Pod lf4 is retrieving peer pod lf4s information from meshnet daemon"
time="2023-03-28T20:03:03Z" level=info msg="Is peer pod lf4s alive?: false"
time="2023-03-28T20:03:03Z" level=info msg="Peer pod lf4s isn't alive yet, continuing"
time="2023-03-28T20:03:03Z" level=info msg="meshnet cni call successful"

Pods lf1 and lf5 are alive, but lf3 is also stuck:

 > k get pods lf1 lf3 lf5
NAME   READY   STATUS     RESTARTS   AGE
lf1    2/2     Running    0          6m24s
lf3    0/2     Init:0/1   0          6m23s
lf5    2/2     Running    0          6m22s

Topos for lf1 and lf5:

apiVersion: v1
items:
- apiVersion: networkop.co.uk/v1beta1
  kind: Topology
  metadata:
    creationTimestamp: "2023-03-28T19:59:43Z"
    generation: 1
    labels:
      app: mimesis-emulation-default
    name: lf1
    namespace: default
    resourceVersion: "552754"
    uid: 70e32a73-a6db-4b71-b582-658c6648404f
  spec:
    links:
    - local_intf: eth1
      local_ip: 192.168.0.0/31
      peer_intf: eth1
      peer_ip: 192.168.0.1/31
      peer_pod: lf2
      uid: 1
    - local_intf: eth2
      local_ip: 192.168.0.2/31
      peer_intf: eth1
      peer_ip: 192.168.0.3/31
      peer_pod: lf3
      uid: 2
    - local_intf: eth3
      local_ip: 192.168.0.4/31
      peer_intf: eth1
      peer_ip: 192.168.0.5/31
      peer_pod: lf4
      uid: 3
    - local_intf: eth4
      local_ip: 192.168.0.6/31
      peer_intf: eth1
      peer_ip: 192.168.0.7/31
      peer_pod: lf5
      uid: 4
    - local_intf: eth5
      local_ip: 192.168.0.8/31
      peer_intf: eth1
      peer_ip: 192.168.0.9/31
      peer_pod: lf6
      uid: 5
    - local_intf: eth6
      local_ip: 192.168.0.10/31
      peer_intf: eth1
      peer_ip: 192.168.0.11/31
      peer_pod: lf7
      uid: 6
    - local_intf: eth7
      local_ip: 192.168.0.12/31
      peer_intf: eth1
      peer_ip: 192.168.0.13/31
      peer_pod: lf8
      uid: 7
    - local_intf: eth8
      local_ip: 192.168.0.14/31
      peer_intf: eth1
      peer_ip: 192.168.0.15/31
      peer_pod: lf9
      uid: 8
    - local_intf: eth9
      local_ip: 192.168.0.16/31
      peer_intf: eth1
      peer_ip: 192.168.0.17/31
      peer_pod: lf10
      uid: 9
    - local_intf: eth10
      local_ip: 192.168.0.18/31
      peer_intf: eth1
      peer_ip: 192.168.0.19/31
      peer_pod: lf11
      uid: 10
  status:
    net_ns: ""
    skipped:
    - lf10
    - lf10
    - lf11
    - lf11
    - lf4
    - lf4
    - lf8
    - lf8
    - lf4
    - lf4
    - lf4
    src_ip: ""
- apiVersion: networkop.co.uk/v1beta1
  kind: Topology
  metadata:
    creationTimestamp: "2023-03-28T19:59:44Z"
    generation: 1
    labels:
      app: mimesis-emulation-default
    name: lf5
    namespace: default
    resourceVersion: "552756"
    uid: 1c47d794-42ca-4d7c-bcf2-2a4b17cb99df
  spec:
    links:
    - local_intf: eth1
      local_ip: 192.168.0.7/31
      peer_intf: eth4
      peer_ip: 192.168.0.6/31
      peer_pod: lf1
      uid: 4
    - local_intf: eth2
      local_ip: 192.168.0.27/31
      peer_intf: eth3
      peer_ip: 192.168.0.26/31
      peer_pod: lf4
      uid: 14
    - local_intf: eth3
      local_ip: 192.168.0.30/31
      peer_intf: eth2
      peer_ip: 192.168.0.31/31
      peer_pod: lf6
      uid: 16
    - local_intf: eth4
      local_ip: 192.168.0.32/31
      peer_intf: eth1
      peer_ip: 192.168.0.33/31
      peer_pod: lf5s
      uid: 17
  status:
    net_ns: ""
    skipped:
    - lf5s
    - lf5s
    - lf1
    - lf1
    - lf4
    - lf4
    - lf4
    - lf4
    - lf4
    src_ip: ""
kind: List
metadata:
  resourceVersion: ""
networkop commented 1 year ago

This doesn't look healthy at all. Both lf1 and lf5 have net_ns: "" and src_ip: "" which explains why meshnet doesn't see them as live, e.g. "Is peer pod lf1 alive?: false". Do you have any idea why that might be? you mention it only happens sometimes... can you describe a bit more what is sometimes? maybe what happens before this condition is triggered?

networkop commented 1 year ago

ph, actually, i just read your original message. so this only happens after you've deleted a pod and redeployed it.

networkop commented 1 year ago

oh, I might actually know what happens here. When a pod is deleted, meshnet cleans up net_ns and src_ip but it doesn't clean up skipped. I guess I never used meshnet like this, I always did a full cleanup and redeploy. I think it might be an easy fix - just cleanup the skipped somewhere around here https://github.com/networkop/meshnet-cni/blob/master/plugin/meshnet.go#L443

Cerebus commented 1 year ago

I'll give it a try and set a PR if it helps

[ETA] It happens /most often/ with "restarting" pods, but it also happens on a clean deployment--but usually after I've been mucking about for a while.

networkop commented 1 year ago

ok, sounds good, let me know if it works. just out of curiosity, why keep the topology CRs (as opposed to clean redeploy each time)?
I'm thinking there may be another bug potentially. When podA and podB come up at different times and podA marks B as skipped, when B eventually comes up and sets up a link, it is not removed from skipped of A. I'm not sure this may result in an actual bug but the logic seems a bit flawed. anyhow, try adding the cleanup, I think this should at least some of the issues you're seeing.

Cerebus commented 1 year ago

I tried deleting the topos with the pods, but it didn't make a difference.

FWIW, what I'm doing here is an outage event. I could exec into the containers and just down the links, but deleting the pods is quicker and makes it clearer that the pod is down when I'm analyzing the collected data (I'm running the PLG stack with SNMP-exporter so I get a full range of instrumentation). If you have advice, I'm open.

networkop commented 1 year ago

I can't think of anything obvious other than that skipped cleanup. Ideally, I'd like to get some scaled down local repro, do you think it's possible?

Cerebus commented 1 year ago

I'm still planning to take a crack at this, but I wanted to point out that I can see this issue happen on a full, clean deployment. Some pods will not get all the expected interfaces. When this happens, it seems to be [sometimes] sticky; IOW, if I undeploy and redeploy, the same pods will get stuck [or the whole deployment succeeds]. That makes me think the state issue is in the CNI interface or in meshnetd, not the resources.

kingshukdev commented 1 year ago

@Cerebus can you please share you CR yaml, containing pods and links. Since it's happening repeatedly at your end, I should be able to recreate it at my end and provide insight (if I can find any).

manomugdha commented 1 year ago

Hi @Cerebus, Along with @kingshukdev comment, it will be good if you can provide the steps you are following to delete and restoration of the pod and the time gap (if any) between delete and restoration.

networkop commented 1 year ago

there is no state anywhere other than in CRs. If you do a clean re-deploy, i.e. delete all existing CRs and recreate them, it should not looks and fail in the same way. ideally, like others have mentioned, if you can share a set of steps to reproduce, this would help move along. Another option is to wait for https://github.com/networkop/meshnet-cni/pull/73 to be merged as it fixes a few cleanup and sync issues, especially for GRPC wires.

Cerebus commented 1 year ago

If I had a reliable reproduction case, I'd probably have either a detailed report or a fix long before now. Mainly I've just suffered with it, but I'm nearing a point where I need to start sharing my platform so it's bubbled back near the top of my fix list.

This condition seems to occur totally at random. I've seen it occur on single-node and multi-node clusters, immediately after a cluster reboot, on an initial topology deployment (or complete re-deployment from scratch), after re-applying a Pod that had been deleted from a running topology, in deployments of 2 Pods or hundreds of Pods, and with (complete or partial) delete/create events separated by seconds, minutes, or hours. All I can say is it /happens/.

You sense my frustration. :(

What I also don't get is why it seems to only happen to me.

It could be that y'all just don't notice. For easy detection, here's what I do in an initContainer (yeah, the shell script is hacky, but it's somewhat limited by ash, and I'll eventually get around to rewriting it to use /sys/class/net). IFACES is a comma-separated list of the interfaces in the Pod's Topology, injected into the Pod. In my platform, I set this from my SDK's resource generation via Downward API, but (a) I can't share that stuff, and (b) it's overkill for just detecting this condition. The main reason I do this is a quick-starting Pod (or a large deployment) can end up with workloads starting before the Pod has all the interfaces added, which can cause failures I'd like to avoid. I basically stole this from was inspired by Arista cEOS's startup script.

#!/bin/sh

# Abort without doing anything if env vars are unset.
: ${IFACES?}

ifacecount() {
    # Count links, not counting lo and container default eth0
    NUMIFACES=$(($(ip link | wc -l) / 2 - 2))
}

ifacecount

# count expected addresses from downward API (count commas and add 1)
EXPECTEDIFACES=$(($(echo ${IFACES} | tr -dc ',' | awk '{ print length; }') + 1))

while [ "${NUMIFACES}" -lt "$EXPECTEDIFACES" ]; do
    echo "waiting; expected: ${EXPECTEDIFACES} found: ${NUMIFACES}"
    sleep 1
    ifacecount
done

This will stick your pods in Init:0 if meshnet fails to add all the interfaces. After that, all I can say is the larger the deployment the more likely it is to happen. But I can't tell you how to /make/ it happen.

I see that #73 merged, so I'll redeploy meshent and see what happens.

Cerebus commented 1 year ago

OK, I have something odd here; not sure if it's reproducible but maybe onto something.

From the above, delete the pod and everything seems ok:

time="2023-04-10T14:18:47Z" level=info msg="INTER_NODE_LINK_TYPE: GRPC"
time="2023-04-10T14:18:47Z" level=info msg="Processing DEL request: lf5"
time="2023-04-10T14:18:47Z" level=info msg="Parsing cni .conf file"
time="2023-04-10T14:18:47Z" level=info msg="Del: Retrieving pod's metadata from meshnet daemon"
time="2023-04-10T14:18:47Z" level=info msg="Del: Retrieving pod's (lf5@default) metadata from meshnet daemon"
time="2023-04-10T14:18:47Z" level=info msg="Del: Topology data still exists in CRs, cleaning up it's status"
time="2023-04-10T14:18:47Z" level=info msg="Del: Iterating over each link for clean-up"
time="2023-04-10T14:18:47Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf1 information from meshnet daemon"
time="2023-04-10T14:18:47Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-f799323a-961b-53cd-3ffc-4889931296a4 and intfName: eth1, IP:192.168.0.7/31"
time="2023-04-10T14:18:47Z" level=info msg="Del: Removing link eth1"
time="2023-04-10T14:18:47Z" level=info msg="Del: Setting skip-reverse flag on peer lf1"
time="2023-04-10T14:18:47Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf4 information from meshnet daemon"
time="2023-04-10T14:18:47Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-f799323a-961b-53cd-3ffc-4889931296a4 and intfName: eth2, IP:192.168.0.27/31"
time="2023-04-10T14:18:47Z" level=info msg="Del: Removing link eth2"
time="2023-04-10T14:18:47Z" level=info msg="Del: Setting skip-reverse flag on peer lf4"
time="2023-04-10T14:18:47Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf6 information from meshnet daemon"
time="2023-04-10T14:18:48Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-f799323a-961b-53cd-3ffc-4889931296a4 and intfName: eth3, IP:192.168.0.30/31"
time="2023-04-10T14:18:48Z" level=info msg="Del: Removing link eth3"
time="2023-04-10T14:18:48Z" level=info msg="Del: Setting skip-reverse flag on peer lf6"
time="2023-04-10T14:18:48Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf5s information from meshnet daemon"
time="2023-04-10T14:18:49Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-f799323a-961b-53cd-3ffc-4889931296a4 and intfName: eth4, IP:192.168.0.32/31"
time="2023-04-10T14:18:49Z" level=info msg="Del: Removing link eth4"
time="2023-04-10T14:18:49Z" level=info msg="Del: Setting skip-reverse flag on peer lf5s"
time="2023-04-10T14:18:49Z" level=info msg="meshnet cni call successful"

Then--recreate that pod before the second (spurious?) delete pod process happens:

time="2023-04-10T14:19:07Z" level=info msg="INTER_NODE_LINK_TYPE: GRPC"
time="2023-04-10T14:19:07Z" level=info msg="Parsing cni .conf file"
time="2023-04-10T14:19:07Z" level=info msg="Parsing CNI_ARGS environment variable"
time="2023-04-10T14:19:07Z" level=info msg="Processing ADD POD lf5 in namespace default"
time="2023-04-10T14:19:07Z" level=info msg="Attempting to connect to local meshnet daemon"
time="2023-04-10T14:19:07Z" level=info msg="Add: Retrieving local pod information from meshnet daemon"
time="2023-04-10T14:19:07Z" level=info msg="Found iface eth0 for address 192.168.5.15"
time="2023-04-10T14:19:07Z" level=info msg="Add: Setting pod alive status on meshnet daemon"
time="2023-04-10T14:19:07Z" level=info msg="Add: Starting to traverse all links"
time="2023-04-10T14:19:07Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-da3acdf3-9bf2-7825-12b8-f3dba621a467 and intfName: eth1, IP:192.168.0.7/31"
time="2023-04-10T14:19:07Z" level=info msg="Add: Pod lf5 is retrieving peer pod lf1 information from meshnet daemon"
time="2023-04-10T14:19:07Z" level=info msg="Add: Is peer pod lf1 alive?: true"
time="2023-04-10T14:19:07Z" level=info msg="Add: Peer pod lf1 is alive"
time="2023-04-10T14:19:07Z" level=info msg="Add: lf5 and lf1 are on the same host"
time="2023-04-10T14:19:07Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-ab74ec1f-6a3e-3847-4283-c29e9d0550ef and intfName: eth4, IP:192.168.0.6/31"
time="2023-04-10T14:19:07Z" level=info msg="Does the link already exist? Local:false, Peer:false"
time="2023-04-10T14:19:07Z" level=info msg="Neither link exists. Checking if we've been skipped"
time="2023-04-10T14:19:07Z" level=info msg="Have we been skipped by our peer lf1? true"
time="2023-04-10T14:19:07Z" level=info msg="DO we have a higher priority? true"
time="2023-04-10T14:19:07Z" level=info msg="Peer POD has skipped us or we have a higher priority"
time="2023-04-10T14:19:07Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-da3acdf3-9bf2-7825-12b8-f3dba621a467 and intfName: eth2, IP:192.168.0.27/31"
time="2023-04-10T14:19:07Z" level=info msg="Add: Pod lf5 is retrieving peer pod lf4 information from meshnet daemon"
time="2023-04-10T14:19:07Z" level=info msg="Add: Is peer pod lf4 alive?: true"
time="2023-04-10T14:19:07Z" level=info msg="Add: Peer pod lf4 is alive"
time="2023-04-10T14:19:07Z" level=info msg="Add: lf5 and lf4 are on the same host"
time="2023-04-10T14:19:07Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-2cf3d065-486f-809b-dec6-ac5cdf7ad748 and intfName: eth3, IP:192.168.0.26/31"
time="2023-04-10T14:19:07Z" level=info msg="Does the link already exist? Local:false, Peer:false"
time="2023-04-10T14:19:07Z" level=info msg="Neither link exists. Checking if we've been skipped"
time="2023-04-10T14:19:07Z" level=info msg="Have we been skipped by our peer lf4? true"
time="2023-04-10T14:19:07Z" level=info msg="DO we have a higher priority? true"
time="2023-04-10T14:19:07Z" level=info msg="Peer POD has skipped us or we have a higher priority"
time="2023-04-10T14:19:08Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-da3acdf3-9bf2-7825-12b8-f3dba621a467 and intfName: eth3, IP:192.168.0.30/31"
time="2023-04-10T14:19:08Z" level=info msg="Add: Pod lf5 is retrieving peer pod lf6 information from meshnet daemon"
time="2023-04-10T14:19:08Z" level=info msg="Add: Is peer pod lf6 alive?: true"
time="2023-04-10T14:19:08Z" level=info msg="Add: Peer pod lf6 is alive"
time="2023-04-10T14:19:08Z" level=info msg="Add: lf5 and lf6 are on the same host"
time="2023-04-10T14:19:08Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-7c8c1dbd-260e-0a30-ada7-1affffd88598 and intfName: eth2, IP:192.168.0.31/31"
time="2023-04-10T14:19:08Z" level=info msg="Does the link already exist? Local:false, Peer:false"
time="2023-04-10T14:19:08Z" level=info msg="Neither link exists. Checking if we've been skipped"
time="2023-04-10T14:19:08Z" level=info msg="Have we been skipped by our peer lf6? true"
time="2023-04-10T14:19:08Z" level=info msg="DO we have a higher priority? false"
time="2023-04-10T14:19:08Z" level=info msg="Peer POD has skipped us or we have a higher priority"
time="2023-04-10T14:19:08Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-da3acdf3-9bf2-7825-12b8-f3dba621a467 and intfName: eth4, IP:192.168.0.32/31"
time="2023-04-10T14:19:08Z" level=info msg="Add: Pod lf5 is retrieving peer pod lf5s information from meshnet daemon"
time="2023-04-10T14:19:08Z" level=info msg="Add: Is peer pod lf5s alive?: true"
time="2023-04-10T14:19:08Z" level=info msg="Add: Peer pod lf5s is alive"
time="2023-04-10T14:19:08Z" level=info msg="Add: lf5 and lf5s are on the same host"
time="2023-04-10T14:19:08Z" level=info msg="Creating Veth struct with NetNS:/var/run/netns/cni-2c404847-7efa-1abb-126e-1439630f254f and intfName: eth1, IP:192.168.0.33/31"
time="2023-04-10T14:19:08Z" level=info msg="Does the link already exist? Local:false, Peer:false"
time="2023-04-10T14:19:08Z" level=info msg="Neither link exists. Checking if we've been skipped"
time="2023-04-10T14:19:08Z" level=info msg="Have we been skipped by our peer lf5s? true"
time="2023-04-10T14:19:08Z" level=info msg="DO we have a higher priority? false"
time="2023-04-10T14:19:08Z" level=info msg="Peer POD has skipped us or we have a higher priority"
time="2023-04-10T14:19:08Z" level=info msg="meshnet cni call successful"

Then, the second "delete" event occurs:

time="2023-04-10T14:19:26Z" level=info msg="INTER_NODE_LINK_TYPE: GRPC"
time="2023-04-10T14:19:26Z" level=info msg="Processing DEL request: lf5"
time="2023-04-10T14:19:26Z" level=info msg="Parsing cni .conf file"
time="2023-04-10T14:19:26Z" level=info msg="Del: Retrieving pod's metadata from meshnet daemon"
time="2023-04-10T14:19:26Z" level=info msg="Del: Retrieving pod's (lf5@default) metadata from meshnet daemon"
time="2023-04-10T14:19:26Z" level=info msg="Del: Topology data still exists in CRs, cleaning up it's status"
time="2023-04-10T14:19:26Z" level=info msg="Del: Iterating over each link for clean-up"
time="2023-04-10T14:19:26Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf1 information from meshnet daemon"
time="2023-04-10T14:19:26Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth1, IP:192.168.0.7/31"
time="2023-04-10T14:19:26Z" level=info msg="Del: Removing link eth1"
time="2023-04-10T14:19:26Z" level=error msg="Del: Error removing Veth link eth1 (veth) on pod lf5: failed to remove link \"eth1\" in \"/proc/9400/task/9400/ns/net\": operation not supported"
time="2023-04-10T14:19:26Z" level=info msg="Del: Setting skip-reverse flag on peer lf1"
time="2023-04-10T14:19:26Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf4 information from meshnet daemon"
time="2023-04-10T14:19:27Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth2, IP:192.168.0.27/31"
time="2023-04-10T14:19:27Z" level=info msg="Del: Removing link eth2"
time="2023-04-10T14:19:27Z" level=error msg="Del: Error removing Veth link eth2 (veth) on pod lf5: failed to lookup \"eth2\" in \"/proc/9400/task/9400/ns/net\": Link not found"
time="2023-04-10T14:19:27Z" level=info msg="Del: Setting skip-reverse flag on peer lf4"
time="2023-04-10T14:19:27Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf6 information from meshnet daemon"
time="2023-04-10T14:19:27Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth3, IP:192.168.0.30/31"
time="2023-04-10T14:19:27Z" level=info msg="Del: Removing link eth3"
time="2023-04-10T14:19:27Z" level=error msg="Del: Error removing Veth link eth3 (veth) on pod lf5: failed to lookup \"eth3\" in \"/proc/9400/task/9400/ns/net\": Link not found"
time="2023-04-10T14:19:27Z" level=info msg="Del: Setting skip-reverse flag on peer lf6"
time="2023-04-10T14:19:28Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf5s information from meshnet daemon"
time="2023-04-10T14:19:28Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth4, IP:192.168.0.32/31"
time="2023-04-10T14:19:28Z" level=info msg="Del: Removing link eth4"
time="2023-04-10T14:19:28Z" level=error msg="Del: Error removing Veth link eth4 (veth) on pod lf5: failed to lookup \"eth4\" in \"/proc/9400/task/9400/ns/net\": Link not found"
time="2023-04-10T14:19:28Z" level=info msg="Del: Setting skip-reverse flag on peer lf5s"
time="2023-04-10T14:19:29Z" level=info msg="meshnet cni call successful"
time="2023-04-10T14:19:29Z" level=info msg="INTER_NODE_LINK_TYPE: GRPC"
time="2023-04-10T14:19:29Z" level=info msg="Processing DEL request: lf5"
time="2023-04-10T14:19:29Z" level=info msg="Parsing cni .conf file"
time="2023-04-10T14:19:29Z" level=info msg="Del: Retrieving pod's metadata from meshnet daemon"
time="2023-04-10T14:19:29Z" level=info msg="Del: Retrieving pod's (lf5@default) metadata from meshnet daemon"
time="2023-04-10T14:19:29Z" level=info msg="Del: Topology data still exists in CRs, cleaning up it's status"
time="2023-04-10T14:19:30Z" level=info msg="Del: Iterating over each link for clean-up"
time="2023-04-10T14:19:30Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf1 information from meshnet daemon"
time="2023-04-10T14:19:30Z" level=info msg="MakeGRPCChanDown: dialing remote node-->lf1@192.168.5.15:51111"
time="2023-04-10T14:19:30Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth1, IP:192.168.0.7/31"
time="2023-04-10T14:19:30Z" level=info msg="Del: Removing link eth1"
time="2023-04-10T14:19:30Z" level=error msg="Del: Error removing Veth link eth1 (grpc) on pod lf5: failed to remove link \"eth1\" in \"/proc/9568/task/9568/ns/net\": operation not supported"
time="2023-04-10T14:19:30Z" level=info msg="Del: Setting skip-reverse flag on peer lf1"
time="2023-04-10T14:19:31Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf4 information from meshnet daemon"
time="2023-04-10T14:19:31Z" level=info msg="MakeGRPCChanDown: dialing remote node-->lf4@192.168.5.15:51111"
time="2023-04-10T14:19:31Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth2, IP:192.168.0.27/31"
time="2023-04-10T14:19:31Z" level=info msg="Del: Removing link eth2"
time="2023-04-10T14:19:31Z" level=error msg="Del: Error removing Veth link eth2 (grpc) on pod lf5: failed to lookup \"eth2\" in \"/proc/9568/task/9568/ns/net\": Link not found"
time="2023-04-10T14:19:31Z" level=info msg="Del: Setting skip-reverse flag on peer lf4"
time="2023-04-10T14:19:32Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf6 information from meshnet daemon"
time="2023-04-10T14:19:32Z" level=info msg="MakeGRPCChanDown: dialing remote node-->lf6@192.168.5.15:51111"
time="2023-04-10T14:19:32Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth3, IP:192.168.0.30/31"
time="2023-04-10T14:19:32Z" level=info msg="Del: Removing link eth3"
time="2023-04-10T14:19:32Z" level=error msg="Del: Error removing Veth link eth3 (grpc) on pod lf5: failed to lookup \"eth3\" in \"/proc/9568/task/9568/ns/net\": Link not found"
time="2023-04-10T14:19:32Z" level=info msg="Del: Setting skip-reverse flag on peer lf6"
time="2023-04-10T14:19:33Z" level=info msg="Del: Pod lf5 is retrieving peer pod lf5s information from meshnet daemon"
time="2023-04-10T14:19:33Z" level=info msg="MakeGRPCChanDown: dialing remote node-->lf5s@192.168.5.15:51111"
time="2023-04-10T14:19:33Z" level=info msg="Creating Veth struct with NetNS: and intfName: eth4, IP:192.168.0.32/31"
time="2023-04-10T14:19:33Z" level=info msg="Del: Removing link eth4"
time="2023-04-10T14:19:33Z" level=error msg="Del: Error removing Veth link eth4 (grpc) on pod lf5: failed to lookup \"eth4\" in \"/proc/9568/task/9568/ns/net\": Link not found"
time="2023-04-10T14:19:33Z" level=info msg="Del: Setting skip-reverse flag on peer lf5s"
time="2023-04-10T14:19:34Z" level=info msg="meshnet cni call successful"

But the pod is still running and has all the right ifaces:

> k get pod lf5
NAME   READY   STATUS    RESTARTS   AGE
lf5    2/2     Running   0          4m7s

~/Documents/projects/fizzy/meshnet-cni master*
> k exec -c tc lf5 -- ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0@if1500: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 9e:fa:ff:df:ed:c3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
1502: eth1@if1501: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP mode DEFAULT group default qlen 1000
    link/ether c6:88:83:c2:95:c0 brd ff:ff:ff:ff:ff:ff link-netnsid 1
1504: eth2@if1503: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP mode DEFAULT group default qlen 1000
    link/ether 92:1c:c9:09:81:34 brd ff:ff:ff:ff:ff:ff link-netnsid 2
1506: eth3@if1505: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP mode DEFAULT group default qlen 1000
    link/ether 2a:c4:3f:a4:16:3a brd ff:ff:ff:ff:ff:ff link-netnsid 3
1508: eth4@if1507: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP mode DEFAULT group default qlen 1000
    link/ether ee:49:98:1b:26:40 brd ff:ff:ff:ff:ff:ff link-netnsid 4

And, to cap it all off, Topology lf5 status has been wiped:

apiVersion: networkop.co.uk/v1beta1
kind: Topology
metadata:
  creationTimestamp: "2023-04-10T14:12:14Z"
  generation: 1
  labels:
    app: mimesis-emulation-default
  name: lf5
  namespace: default
  resourceVersion: "814247"
  uid: c0b785db-6a10-410d-9aa6-b17993c75270
spec:
  links:
  - local_intf: eth1
    local_ip: 192.168.0.7/31
    peer_intf: eth4
    peer_ip: 192.168.0.6/31
    peer_pod: lf1
    uid: 4
  - local_intf: eth2
    local_ip: 192.168.0.27/31
    peer_intf: eth3
    peer_ip: 192.168.0.26/31
    peer_pod: lf4
    uid: 14
  - local_intf: eth3
    local_ip: 192.168.0.30/31
    peer_intf: eth2
    peer_ip: 192.168.0.31/31
    peer_pod: lf6
    uid: 16
  - local_intf: eth4
    local_ip: 192.168.0.32/31
    peer_intf: eth1
    peer_ip: 192.168.0.33/31
    peer_pod: lf5s
    uid: 17
status:
  net_ns: ""
  skipped: []
  src_ip: ""

From which I predict that if I delete a neighbor and recreate it, it'll get stuck b/c lf5 "does not exist". And indeed, 'tis so, as the new lf1 pod is missing eth4:

> k get pods lf1
NAME   READY   STATUS     RESTARTS   AGE
lf1    0/2     Init:0/1   0          19s

~/Documents/projects/fizzy/meshnet-cni master*
> k exec -c miminit lf1 -- ip link
[...]
1515: eth3@if1514: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 96:4c:f7:18:b3:09 brd ff:ff:ff:ff:ff:ff link-netnsid 3
1517: eth5@if1516: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether c6:ee:0c:06:ff:7a brd ff:ff:ff:ff:ff:ff link-netnsid 4
[...]

ETA: Which further implies that I can recover by deleting the /pair/ of problem pods and redeploying them, and indeed, that works.

It's possible that when I see this condition on a full start it's b/c it came after a full-teardown, so this double-call of CNI DEL wasn't complete yet.

Cerebus commented 1 year ago

Is this relevant: https://github.com/kubernetes/kubernetes/issues/20379#issuecomment-255272531

networkop commented 1 year ago

It looks like you're onto something. I'm AFK for the rest of the week but it'd be nice to confirm where that second DELETE call originates from. Can you capture the logs from the API server at the same time to confirm the assumption from the k/k issue you linked?

Cerebus commented 1 year ago

A couple of things--

IMHO this is a meshnet bug; DEL commands are not idempotent. Pod networking is tracked by the tuple (namespace, name), but this isn't unique over time. As a result, the interleaving of CNI ADD and DEL commands results in error, when late DEL commands erase state stored from ADD commands.

One way to resolve this is to track status by CONTAINER_ID (i.e, sandbox). Conveniently this is a CNI argument (my own debugging statement, don't look for this anywhere):

time="2023-04-10T18:56:32Z" level=info msg="DEL request arguments" args="&{ContainerID:2444f14ea74d6a97a1532899c8edea6164b56993b79d5adbfc70db4af942605e Netns:/var/run/netns/cni-ba59b41e-41a3-0eda-d69a-19c420a1b3a7 IfName:eth0 Args:K8S_POD_INFRA_CONTAINER_ID=2444f14ea74d6a97a1532899c8edea6164b56993b79d5adbfc70db4af942605e;K8S_POD_UID=f18e557f-30fb-449d-a45e-8364d4e7b07d;IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=pod3 Path:/usr/libexec/cni StdinData:[123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 48 34 44 34 100 110 115 34 58 123 125 44 34 105 112 97 109 34 58 123 125 44 34 110 97 109 101 34 58 34 99 98 114 48 34 44 34 116 121 112 101 34 58 34 109 101 115 104 110 101 116 34 125]}"

I'll work on this unless someone has a better idea. I'd like to do this without breaking changes, but just looking at the status subresource CRD that may not be possible.

kingshukdev commented 1 year ago

@Cerebus Yes, I was also working on handling multiple delete. This is common K8S behavior but was not handled in meshnet. Pull #74 looks good to me. I will stop on my part and will wait for pull #74 to come in.

Interleaving of ADD and DEL we need to handle. If a DEL comes while ADD is in motion, meshnet will be in trouble. I will work on this one.

Cerebus commented 1 year ago

Wouldn't kubelet serialize the operations in that case?

kingshukdev commented 1 year ago

That's very resorbable and that's how it should be. But not able to find a definite documentation on it yet.

sar772004 commented 1 year ago

I still have this issue even with the fix for #74 where certain veth pairs are not being formed. I have not seen this issue with Vxlan binds still. @Cerebus did your issue get solved ?

cc: @networkop @kingshukdev

What i notice is there is this error and then delete of one of the pods (dut-c). then all other pods are missing the links to this dut-c pod

time="2023-04-26T21:33:42-04:00" level=info msg="Creating Veth struct with NetNS:/proc/19196/ns/net and intfName: eth5, IP:" time="2023-04-26T21:33:42-04:00" level=info msg="Does the link already exist? Local:false, Peer:false" time="2023-04-26T21:33:42-04:00" level=info msg="Neither link exists. Checking if we've been skipped" time="2023-04-26T21:33:43-04:00" level=info msg="Have we been skipped by our peer test? true" time="2023-04-26T21:33:43-04:00" level=info msg="DO we have a higher priority? false" time="2023-04-26T21:33:43-04:00" level=info msg="Peer POD has skipped us or we have a higher priority" time="2023-04-26T21:33:43-04:00" level=error msg="Error when creating a new VEth pair with koko: failed to rename link koko1662009491 -> e1-21: file exists" time="2023-04-26T21:33:43-04:00" level=info msg="MY VETH STRUCT: (api.VEth)(0xc0003b2120)({\n NsName: (string) (len=18) \"/proc/20429/ns/net\",\n LinkName: (string) (len=5) \"e1-21\",\n IPAddr: ([]net.IPNet) ,\n MirrorEgress: (string) \"\",\n MirrorIngress: (string) \"\"\n})\n" time="2023-04-26T21:33:43-04:00" level=info msg="PEER STRUCT: (api.VEth)(0xc000538000)({\n NsName: (string) (len=18) \"/proc/19196/ns/net\",\n LinkName: (string) (len=4) \"eth5\",\n IPAddr: ([]net.IPNet) ,\n MirrorEgress: (string) \"\",\n MirrorIngress: (string) \"\"\n})\n" time="2023-04-26T21:33:43-04:00" level=error msg="failed to run meshnet cni: "

Then there is DEL for dut-c

time="2023-04-26T21:33:43-04:00" level=info msg="Processing DEL request: dut-c" time="2023-04-26T21:33:43-04:00" level=info msg="DEL request arguments" args="&{ContainerID:d6984a3b01f1af3084952383391dbb9eadce711399f646d0c3b8b97d19e78f25 Netns:/proc/20429/ns/net IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=dut-c;K8S_POD_INFRA_CONTAINER_ID=d6984a3b01f1af3084952383391dbb9eadce711399f646d0c3b8b97d19e78f25 Path:/opt/cni/bin StdinData:[123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 48 34 44 34 100 110 115 34 58 123 125 44 34 105 112 97 109 34 58 123 125 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 116 121 112 101 34 58 34 109 101 115 104 110 101 116 34 125]}" time="2023-04-26T21:33:43-04:00" level=info msg="Parsing cni .conf file" time="2023-04-26T21:33:43-04:00" level=info msg="Del: Retrieving pod's (dut-c@default) metadata from meshnet daemon" time="2023-04-26T21:33:43-04:00" level=info msg="Del: Pod default:dut-c is a duplicate; skipping" time="2023-04-26T21:33:43-04:00" level=info msg="meshnet cni call successful" time="2023-04-26T21:33:44-04:00" level=info msg="INTER_NODE_LINK_TYPE: VXLAN" time="2023-04-26T21:33:44-04:00" level=info msg="Processing DEL request: dut-c" time="2023-04-26T21:33:44-04:00" level=info msg="DEL request arguments" args="&{ContainerID:d6984a3b01f1af3084952383391dbb9eadce711399f646d0c3b8b97d19e78f25 Netns: IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=dut-c;K8S_POD_INFRA_CONTAINER_ID=d6984a3b01f1af3084952383391dbb9eadce711399f646d0c3b8b97d19e78f25 Path:/opt/cni/bin StdinData:[123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 48 34 44 34 100 110 115 34 58 123 125 44 34 105 112 97 109 34 58 123 125 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 116 121 112 101 34 58 34 109 101 115 104 110 101 116 34 125]}" time="2023-04-26T21:33:44-04:00" level=info msg="Parsing cni .conf file" time="2023-04-26T21:33:44-04:00" level=info msg="Del: Retrieving pod's (dut-c@default) metadata from meshnet daemon" time="2023-04-26T21:33:44-04:00" level=info msg="Del: Pod default:dut-c is a duplicate; skipping" time="2023-04-26T21:33:44-04:00" level=info msg="meshnet cni call successful" time="2023-04-26T21:33:44-04:00" level=info msg="INTER_NODE_LINK_TYPE: VXLAN" time="2023-04-26T21:33:44-04:00" level=info msg="Parsing cni .conf file" time="2023-04-26T21:33:44-04:00" level=info msg="Parsing CNI_ARGS environment variable" time="2023-04-26T21:33:44-04:00" level=info msg="Processing ADD POD dut-c in namespace default" time="2023-04-26T21:33:44-04:00" level=info msg="Attempting to connect to local meshnet daemon" time="2023-04-26T21:33:44-04:00" level=info msg="Add: Retrieving local pod information from meshnet daemon" time="2023-04-26T21:33:44-04:00" level=info msg="Add: Setting pod alive status on meshnet daemon" time="2023-04-26T21:33:44-04:00" level=info msg="Add: Starting to traverse all links"

kingshukdev commented 1 year ago

The portion of the log given here shows, all the 'Del' to be duplicate. This looked little odd. May be the log is incomplete. I can try this at my end. Using a single node cluster and multiple interconnected pod in it (so that all links are veth pair) - will it be close to your set up ?

sar772004 commented 1 year ago

The portion of the log given here shows, all the 'Del' to be duplicate. This looked little odd. May be the log is incomplete. I can try this at my end. Using a single node cluster and multiple interconnected pod in it (so that all links are veth pair) - will it be close to your set up ?

That is right, i have a lot of connections and 8 pods interconnecting.. Also i noticed the logs in /var/log/cni.log is does not report the localPod and peer pod info in all log msgs in this veth create ADD, that would make it easier to debug the sequence, i m trying to collect some logs with that added. especially finding out why the veth endpoint existed on dut-c.. Also not sure we should crash and start DEL process if the veth endpoint exists, probably can we just continue ?

This return i mean in meshnet.go log.Infof("PEER STRUCT: %+v", spew.Sdump(peerVeth)) return err

sar772004 commented 1 year ago

After adding some extra logging to meshnet.go for veth creation in cmdAdd function, i can see there is a possiblity of race condition When one end pod (X) of the link is entered the cmdAdd , and in parallel a request to cmdAdd comes from the other end pod(Y) of the link. And because we have this or condition, it possible, both try to add the links, https://github.com/networkop/meshnet-cni/blob/master/plugin/meshnet.go#L307 X because it has higher priority Y because it sees the isSkipped set to true still. But the interface is already created , so the one coming last will fail the koko.makeVeth call and return err. which cause some other cascading issues where links to pod Y are missing on other pods.

sar772004 commented 1 year ago

@networkop @kingshukdev @Cerebus Please review the changes in this pull request, it fixed my issue due to this race condition. https://github.com/networkop/meshnet-cni/pull/76