k8snetworkplumbingwg / sriov-network-device-plugin

SRIOV network device plugin for Kubernetes
Apache License 2.0
390 stars 174 forks source link

CDI: two sriovnodepolicy configs(8 vfs) but only four deviceNode #576

Closed cyclinder closed 1 week ago

cyclinder commented 1 month ago

What happened?

I have two sriovnodepolicy configs, see below:

root@controller-node-1:/home/cyclinder/sriov# kubectl get sriovnetworknodepolicies.sriovnetwork.openshift.io -A -o wide
NAMESPACE     NAME      AGE
kube-system   policy1   43m
kube-system   policy2   7m44s
root@controller-node-1:/home/cyclinder/sriov# kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n kube-system -o yaml
apiVersion: v1
items:
- apiVersion: sriovnetwork.openshift.io/v1
  kind: SriovNetworkNodeState
  metadata:
    annotations:
      sriovnetwork.openshift.io/current-state: Idle
      sriovnetwork.openshift.io/desired-state: Idle
    creationTimestamp: "2024-07-16T03:50:05Z"
    generation: 5
    name: worker-node-1
    namespace: kube-system
    ownerReferences:
    - apiVersion: sriovnetwork.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: SriovOperatorConfig
      name: default
      uid: 2427ae73-ef95-4f57-aa85-c681ff9a48bb
    resourceVersion: "40147316"
    uid: 67f59ed6-85a3-4913-a1bf-3697dd008310
  spec:
    interfaces:
    - name: enp4s0f0np0
      numVfs: 4
      pciAddress: "0000:04:00.0"
      vfGroups:
      - isRdma: true
        policyName: policy1
        resourceName: rdma_resource
        vfRange: 0-3
    - name: enp4s0f1np1
      numVfs: 4
      pciAddress: "0000:04:00.1"
      vfGroups:
      - isRdma: true
        policyName: policy2
        resourceName: rdma_resource1
        vfRange: 0-3
  status:
    interfaces:
    - Vfs:
      - deviceID: "1018"
        driver: mlx5_core
        mac: e6:bc:60:22:14:6c
        mtu: 1500
        name: enp4s0f0v0
        pciAddress: "0000:04:00.2"
        vendor: 15b3
        vfID: 0
      - deviceID: "1018"
        driver: mlx5_core
        mac: a2:d7:89:ad:5d:b7
        mtu: 1500
        name: enp4s0f0v1
        pciAddress: "0000:04:00.3"
        vendor: 15b3
        vfID: 1
      - deviceID: "1018"
        driver: mlx5_core
        mac: d2:0b:3f:c9:ab:a4
        mtu: 1500
        name: enp4s0f0v2
        pciAddress: "0000:04:00.4"
        vendor: 15b3
        vfID: 2
      - deviceID: "1018"
        driver: mlx5_core
        mac: 4e:37:ab:b2:68:d7
        mtu: 1500
        name: enp4s0f0v3
        pciAddress: "0000:04:00.5"
        vendor: 15b3
        vfID: 3
      deviceID: "1017"
      driver: mlx5_core
      eSwitchMode: legacy
      linkSpeed: 25000 Mb/s
      linkType: ETH
      mac: 04:3f:72:d0:d2:b2
      mtu: 1500
      name: enp4s0f0np0
      numVfs: 4
      pciAddress: "0000:04:00.0"
      totalvfs: 4
      vendor: 15b3
    - Vfs:
      - deviceID: "1018"
        driver: mlx5_core
        mac: 3e:3a:7f:af:11:99
        mtu: 1500
        name: enp4s0f1v0
        pciAddress: "0000:04:00.6"
        vendor: 15b3
        vfID: 0
      - deviceID: "1018"
        driver: mlx5_core
        mac: 6e:c1:0e:52:ea:d8
        mtu: 1500
        name: enp4s0f1v1
        pciAddress: "0000:04:00.7"
        vendor: 15b3
        vfID: 1
      - deviceID: "1018"
        driver: mlx5_core
        mac: 8e:c8:1d:fc:69:0d
        mtu: 1500
        name: enp4s0f1v2
        pciAddress: "0000:04:01.0"
        vendor: 15b3
        vfID: 2
      - deviceID: "1018"
        driver: mlx5_core
        mac: 52:4c:5c:b1:1d:44
        mtu: 1500
        name: enp4s0f1v3
        pciAddress: "0000:04:01.1"
        vendor: 15b3
        vfID: 3
      deviceID: "1017"
      driver: mlx5_core
      eSwitchMode: legacy
      linkSpeed: 10000 Mb/s
      linkType: ETH
      mac: 04:3f:72:d0:d2:b3
      mtu: 1500
      name: enp4s0f1np1
      numVfs: 4
      pciAddress: "0000:04:00.1"
      totalvfs: 4
      vendor: 15b3
    syncStatus: Succeeded
kind: List
metadata:
  resourceVersion: ""
root@worker-node-1:~# cat /var/run/cdi/sriov-dp-spidernet.io.yaml
cdiVersion: 0.5.0
containerEdits: {}
devices:
- containerEdits:
    deviceNodes:
    - hostPath: /dev/infiniband/issm6
      path: /dev/infiniband/issm6
      permissions: rw
    - hostPath: /dev/infiniband/umad6
      path: /dev/infiniband/umad6
      permissions: rw
    - hostPath: /dev/infiniband/uverbs6
      path: /dev/infiniband/uverbs6
      permissions: rw
    - hostPath: /dev/infiniband/rdma_cm
      path: /dev/infiniband/rdma_cm
      permissions: rw
  name: "0000:04:00.6"
- containerEdits:
    deviceNodes:
    - hostPath: /dev/infiniband/issm7
      path: /dev/infiniband/issm7
      permissions: rw
    - hostPath: /dev/infiniband/umad7
      path: /dev/infiniband/umad7
      permissions: rw
    - hostPath: /dev/infiniband/uverbs7
      path: /dev/infiniband/uverbs7
      permissions: rw
    - hostPath: /dev/infiniband/rdma_cm
      path: /dev/infiniband/rdma_cm
      permissions: rw
  name: "0000:04:00.7"
- containerEdits:
    deviceNodes:
    - hostPath: /dev/infiniband/issm8
      path: /dev/infiniband/issm8
      permissions: rw
    - hostPath: /dev/infiniband/umad8
      path: /dev/infiniband/umad8
      permissions: rw
    - hostPath: /dev/infiniband/uverbs8
      path: /dev/infiniband/uverbs8
      permissions: rw
    - hostPath: /dev/infiniband/rdma_cm
      path: /dev/infiniband/rdma_cm
      permissions: rw
  name: "0000:04:01.0"
- containerEdits:
    deviceNodes:
    - hostPath: /dev/infiniband/issm9
      path: /dev/infiniband/issm9
      permissions: rw
    - hostPath: /dev/infiniband/umad9
      path: /dev/infiniband/umad9
      permissions: rw
    - hostPath: /dev/infiniband/uverbs9
      path: /dev/infiniband/uverbs9
      permissions: rw
    - hostPath: /dev/infiniband/rdma_cm
      path: /dev/infiniband/rdma_cm
      permissions: rw
  name: "0000:04:01.1"
kind: spidernet.io/net-pci

What did you expect to happen?

What are the minimal steps needed to reproduce the bug?

Anything else we need to know?

https://github.com/k8snetworkplumbingwg/sriov-network-operator/issues/735

Component Versions

Please fill in the below table with the version numbers of components used.

Component Version
SR-IOV Network Device Plugin
SR-IOV CNI Plugin
Multus
Kubernetes
OS

Config Files

Config file locations may be config dependent.

Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
CNI config (Try '/etc/cni/net.d/')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)
Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
adrianchiris commented 1 month ago

please see discussions in [1]

[1] https://github.com/k8snetworkplumbingwg/sriov-network-operator/issues/735

souleb commented 1 month ago

When use-cdi is enabled, a cdiSpec is created for each resourcePool on every call to ListWatch of the grpc server. The cdiSpec is then written to DefaultDynamicDir+cdiSpecPrefix+resourcePrefix with expand to /var/run/cdi/sriov-dp-nvidia.com.yaml. The function that write the cdiSpec does an atomic write, i.e. write to a temp file, save and then rename to the target name. This is in conflict with our desire to write all specs to the same file.

In order to fix this, we should either generate a unique file name for each resourcePool using GenerateNameForTransientSpec or create a shared memory cache that would handle writes to the local file.

cyclinder commented 1 month ago

I'm interested in this, I can help fix it later.

adrianchiris commented 1 month ago

In order to fix this, we should either generate a unique file name for each resourcePool

what needs to be kept in mind is that device plugin resource configuration may change in that case "old" cdi files need to get deleted.