k8snetworkplumbingwg / ovs-cni

Open vSwitch CNI plugin
Apache License 2.0
224 stars 71 forks source link

CNI failed to retrieve network namespace path #106

Closed GeorgFoc closed 4 years ago

GeorgFoc commented 4 years ago

Hi, I think we found a bug maybe. Or I made a misstake with the configuration and would be thankful for help.

Generel the ovs-cni works great. But if I hard kill a Kubernetes Node (or do a reboot without draining the node), on which are pods running with the ovs-cni, the pods does not restart propably.

I startet the samplepod from the documentation and than reboot the ubuntu system which runs the kubernetes node (with reboot, no drain before). Before the reboot, the pod is running well. After the reboot kubectl get pods says: samplepod-1 0/1 Error 0 8m1s <none>

When I describe the Pod I get: Normal SandboxChanged 8s (x14 over 2m41s) kubelet, HOSTNAME Pod sandbox changed, it will be killed and re-created.

And the Kubelet Log says:

Feb 18 11:34:35 HOSTNAME kubelet[1219]: 2020/02/18 11:34:35 CNI DEL was called for container ID: bc79277db8b567a2864807a9d310eb240bffed01d840df0135ea34198ffd017b, network namespace , interface name
Feb 18 11:34:35 HOSTNAME kubelet[1219]: panic: This should never happen, if it does, it means caller does not pass container network namespace as a parameter and therefore OVS port cleanup will not
Feb 18 11:34:35 HOSTNAME kubelet[1219]: goroutine 1 [running, locked to thread]:
Feb 18 11:34:35 HOSTNAME kubelet[1219]: github.com/kubevirt/ovs-cni/pkg/plugin.CmdDel(0xc0000f2620, 0xc000014500, 0x5)
Feb 18 11:34:35 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/pkg/plugin/plugin.go:276 +0x792
Feb 18 11:34:35 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc00009bed8, 0xc0000f2620, 0x71a080, 0xc0000cc4e0, 0x6d0d30, 0x0, 0xc00009bec0)
Feb 18 11:34:35 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:185 +0x259
Feb 18 11:34:35 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc00009bed8, 0x6d0d20, 0x6d0d28, 0x6d0d30, 0x71a080, 0xc0000cc4e0, 0xc00001c060, 0x25, 0
Feb 18 11:34:35 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:251 +0x3d4
Feb 18 11:34:35 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)
Feb 18 11:34:35 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:286
Feb 18 11:34:35 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.PluginMain(0x6d0d20, 0x6d0d28, 0x6d0d30, 0x71a080, 0xc0000cc4e0, 0xc00001c060, 0x25)
Feb 18 11:34:35 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:301 +0x129
Feb 18 11:34:35 HOSTNAME kubelet[1219]: main.main()
Feb 18 11:34:35 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/cmd/plugin/main.go:26 +0x11c
Feb 18 11:34:35 HOSTNAME kubelet[1219]: E0218 11:34:35.943468    1219 cni.go:379] Error deleting loadbalancer_samplepod-1/bc79277db8b567a2864807a9d310eb240bffed01d840df0135ea34198ffd017b from networ
Feb 18 11:34:35 HOSTNAME kubelet[1219]: E0218 11:34:35.944454    1219 remote_runtime.go:128] StopPodSandbox "bc79277db8b567a2864807a9d310eb240bffed01d840df0135ea34198ffd017b" from runtime service fa
Feb 18 11:34:35 HOSTNAME kubelet[1219]: E0218 11:34:35.944527    1219 kuberuntime_manager.go:878] Failed to stop sandbox {"docker" "bc79277db8b567a2864807a9d310eb240bffed01d840df0135ea34198ffd017b"}
Feb 18 11:34:35 HOSTNAME kubelet[1219]: E0218 11:34:35.944618    1219 kuberuntime_manager.go:658] killPodWithSyncResult failed: failed to "KillPodSandbox" for "08155d12-cdcd-4c2d-a7cc-ad8f8f14761e"
Feb 18 11:34:35 HOSTNAME kubelet[1219]: E0218 11:34:35.944663    1219 pod_workers.go:191] Error syncing pod 08155d12-cdcd-4c2d-a7cc-ad8f8f14761e ("samplepod-1_loadbalancer(08155d12-cdcd-4c2d-a7cc-ad
Feb 18 11:34:38 HOSTNAME kubelet[1219]: E0218 11:34:38.621370    1219 summary_sys_containers.go:47] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stat
Feb 18 11:34:38 HOSTNAME kubelet[1219]: W0218 11:34:38.854130    1219 cni.go:328] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "c12faf953
Feb 18 11:34:38 HOSTNAME kubelet[1219]: 2020/02/18 11:34:38 CNI DEL was called for container ID: c12faf95367d396da18c3b90b1099ebce6b372bc073622bba1d1b88bb621e2c6, network namespace , interface name
Feb 18 11:34:38 HOSTNAME kubelet[1219]: panic: This should never happen, if it does, it means caller does not pass container network namespace as a parameter and therefore OVS port cleanup will not
Feb 18 11:34:38 HOSTNAME kubelet[1219]: goroutine 1 [running, locked to thread]:
Feb 18 11:34:38 HOSTNAME kubelet[1219]: github.com/kubevirt/ovs-cni/pkg/plugin.CmdDel(0xc000106690, 0xc0000ea4a8, 0x5)
Feb 18 11:34:38 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/pkg/plugin/plugin.go:276 +0x792
Feb 18 11:34:38 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc0000cbed8, 0xc000106690, 0x71a080, 0xc0000d04e0, 0x6d0d30, 0x0, 0xc0000cbec0)
Feb 18 11:34:38 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:185 +0x259
Feb 18 11:34:38 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc0000cbed8, 0x6d0d20, 0x6d0d28, 0x6d0d30, 0x71a080, 0xc0000d04e0, 0xc00012a000, 0x25, 0
Feb 18 11:34:38 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:251 +0x3d4
Feb 18 11:34:38 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)
Feb 18 11:34:38 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:286
Feb 18 11:34:38 HOSTNAME kubelet[1219]: github.com/containernetworking/cni/pkg/skel.PluginMain(0x6d0d20, 0x6d0d28, 0x6d0d30, 0x71a080, 0xc0000d04e0, 0xc00012a000, 0x25)
Feb 18 11:34:38 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:301 +0x129
Feb 18 11:34:38 HOSTNAME kubelet[1219]: main.main()
Feb 18 11:34:38 HOSTNAME kubelet[1219]:         /home/travis/gopath/src/github.com/kubevirt/ovs-cni/cmd/plugin/main.go:26 +0x11c

The Openvswitch Log says after reboot:

...
     Port "vethb0617177"
            Interface "vethb0617177"
                error: "could not open network device vethb0617177 (No such device)"
        Port "veth5f25970b"
            Interface "veth5f25970b"
                error: "could not open network device veth5f25970b (No such device)"
        Port "veth62e7faab"
            Interface "veth62e7faab"
                error: "could not open network device veth62e7faab (No such device)"
        Port "vethc5110e9d"
            Interface "vethc5110e9d"
                error: "could not open network device vethc5110e9d (No such device)"
        Port "vethef8643f7"
            Interface "vethef8643f7"
                error: "could not open network device vethef8643f7 (No such device)"
        Port "veth12efe3d7"
            Interface "veth12efe3d7"
                error: "could not open network device veth12efe3d7 (No such device)"

Here some Informations, which could help to reproduce this behaviour: Kubernetes on Bare Metal Server Kubernetes Version: v1.16.3 Multus-CNI Version: v3.4

Here is the deployment yaml:

---
apiVersion: v1
kind: Namespace
metadata:
  name: cluster-network-addons
  labels:
    name: cluster-network-addons

---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: networkaddonsconfigs.networkaddonsoperator.network.kubevirt.io
spec:
  group: networkaddonsoperator.network.kubevirt.io
  names:
    kind: NetworkAddonsConfig
    listKind: NetworkAddonsConfigList
    plural: networkaddonsconfigs
    singular: networkaddonsconfig
  scope: Cluster
  subresources:
    status: {}
  validation:
    openAPIV3Schema:
      properties:
        apiVersion:
          type: string
        kind:
          type: string
        metadata:
          type: object
        spec:
          type: object
        status:
          type: object
  version: v1alpha1
  versions:
  - name: v1alpha1
    served: true
    storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    kubevirt.io: ""
  name: cluster-network-addons-operator
  namespace: cluster-network-addons
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    name: cluster-network-addons-operator
  name: cluster-network-addons-operator
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - operator.openshift.io
  resources:
  - networks
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networkaddonsoperator.network.kubevirt.io
  resources:
  - networkaddonsconfigs
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    kubevirt.io: ""
  name: cluster-network-addons-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-network-addons-operator
subjects:
  - kind: ServiceAccount
    name: cluster-network-addons-operator
    namespace: cluster-network-addons
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    name: cluster-network-addons-operator
  name: cluster-network-addons-operator
  namespace: cluster-network-addons
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - configmaps
  verbs:
  - get
  - list
  - watch
  - create
  - patch
  - update
  - delete
- apiGroups:
  - apps
  resources:
  - deployments
  - replicasets
  verbs:
  - get
  - list
  - watch
  - create
  - patch
  - update
  - delete

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    kubevirt.io: ""
  name: cluster-network-addons-operator
  namespace: cluster-network-addons
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-network-addons-operator
subjects:
  - kind: ServiceAccount
    name: cluster-network-addons-operator

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    networkaddonsoperator.network.kubevirt.io/version: 0.23.0
  name: cluster-network-addons-operator
  namespace: cluster-network-addons
spec:
  replicas: 1
  selector:
    matchLabels:
      name: cluster-network-addons-operator
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        name: cluster-network-addons-operator
    spec:
      containers:
      - env:
        - name: MULTUS_IMAGE
          value: quay.io/kubevirt/cluster-network-addon-multus:v3.2.0-1.gitbf61002
        - name: LINUX_BRIDGE_IMAGE
          value: quay.io/kubevirt/cni-default-plugins:v0.8.1
        - name: LINUX_BRIDGE_MARKER_IMAGE
          value: quay.io/kubevirt/bridge-marker:0.2.0
        - name: NMSTATE_HANDLER_IMAGE
          value: quay.io/nmstate/kubernetes-nmstate-handler:v0.12.0
        - name: OVS_CNI_IMAGE
          value: quay.io/kubevirt/ovs-cni-plugin:v0.8.0
        - name: OVS_MARKER_IMAGE
          value: quay.io/kubevirt/ovs-cni-marker:v0.8.0
        - name: KUBEMACPOOL_IMAGE
          value: quay.io/kubevirt/kubemacpool:v0.8.0
        - name: OPERATOR_IMAGE
          value: quay.io/kubevirt/cluster-network-addons-operator:0.23.0
        - name: OPERATOR_NAME
          value: cluster-network-addons-operator
        - name: OPERATOR_VERSION
          value: 0.23.0
        - name: OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: OPERAND_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: WATCH_NAMESPACE
        image: quay.io/kubevirt/cluster-network-addons-operator:0.23.0
        imagePullPolicy: Always
        name: cluster-network-addons-operator
        resources: {}
      serviceAccountName: cluster-network-addons-operator

Thanks for your help. And let me know, if you need more inforamtions.

phoracek commented 4 years ago

Hello @GeorgFoc, thanks for opening of the Issue!

OVS CNI database is persistent across reboots. We also save metadata of Pod connections to its database (https://github.com/kubevirt/ovs-cni/blob/d61bd2b26f6c1df8ed38a4572b2ab4adaf48245d/pkg/ovsdb/ovsdb.go#L298). So when a new CNI attachment is executed on the same pod after reboot, we employ the same metadata and that breaks the code.

The fix would be to add cleanup to ADD, so if there is a leftover port from previous run, we would clean it up. Or alternatively fix the code that fails due to the collision of metadata.

I don't have capacity to fix this at the moment, would you mind looking into it and proposing a PR?

GeorgFoc commented 4 years ago

Hi @phoracek, thanks for the quick answer. I will give the issue to our go developers to look into it. We will hopefully make a PR soon.

phoracek commented 4 years ago

Resoled: #109

@GeorgFoc @fhofherr I released a new version of OVS CNI which includes the fix: https://github.com/kubevirt/ovs-cni/releases/tag/v0.11.0

phoracek commented 4 years ago

Closing since there was no recent activity and this seems to be resolved.