k8snetworkplumbingwg / sriov-network-device-plugin

SRIOV network device plugin for Kubernetes
Apache License 2.0
383 stars 174 forks source link

Capacity and Allocatable number shows wrong if sriov-network-device-plugin restarts #565

Open jslouisyou opened 3 weeks ago

jslouisyou commented 3 weeks ago

What happened?

Node Capacity and Allocatable number shows wrong in case of restarting sriov-network-device-plugin if any pods attach SR-IOV IB VFs.

What did you expect to happen?

openshift.io/gpu_mlnx_ib# should be 8 in all VFs.

What are the minimal steps needed to reproduce the bug?

  1. Deploy sriov-network-operator version v1.2.0
  2. Create a Pod or Deployment
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: sriov-testing-deployment-h100
    spec:
    replicas: 6
    selector:
    matchLabels:
      app: sriov-testing-h100
    template:
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: '[{"name": "sriov-gpu2-ib0", "interface": "net1"},
          {"name": "sriov-gpu2-ib1", "interface": "net2"}, {"name": "sriov-gpu2-ib2",
          "interface": "net3"}, {"name": "sriov-gpu2-ib3", "interface": "net4"}, {"name":
          "sriov-gpu2-ib4", "interface": "net5"}]'
      labels:
        app: sriov-testing-h100
      name: sriov-testing-pod
    spec:
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: mellanox/tcpdump-rdma:latest
        imagePullPolicy: Always
        name: tcpdump-rdma
        resources:
          limits:
            openshift.io/gpu2_mlnx_ib0: "1"
            openshift.io/gpu2_mlnx_ib1: "1"
            openshift.io/gpu2_mlnx_ib2: "1"
            openshift.io/gpu2_mlnx_ib3: "1"
            openshift.io/gpu2_mlnx_ib4: "1"
          requests:
            openshift.io/gpu2_mlnx_ib0: "1"
            openshift.io/gpu2_mlnx_ib1: "1"
            openshift.io/gpu2_mlnx_ib2: "1"
            openshift.io/gpu2_mlnx_ib3: "1"
            openshift.io/gpu2_mlnx_ib4: "1"
        securityContext:
          capabilities:
            add:
            - IPC_LOCK
  3. Rollout sriov-device-plugin daemonset
    k rollout restart -n sriov-network-operator daemonset.apps/sriov-device-plugin
  4. Check whether Capacity and Allocatable shows full capacity or not

Anything else we need to know?

There were several issues already raised and commits were pushed, but it seems that this issue won't be fixed yet. xref) https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/276, https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/521

After restaring sriov-device-plugin, kubelet says that sriov-device-plugin pushed its state like below:

kubelet[1223475]: I0605 16:48:30.294908 1223475 manager.go:229] "Device plugin connected" resourceName="openshift.io/gpu_mlnx_ib0"
kubelet[1223475]: I0605 16:48:30.295508 1223475 client.go:91] "State pushed for device plugin" resource="openshift.io/gpu_mlnx_ib0" resourceCapacity=2
kubelet[1223475]: I0605 16:48:30.295721 1223475 http2_client.go:959] "[transport] [client-transport 0xc004920000] Closing: connection error: desc = \"error reading from server: read unix @->/var/lib/kubelet/plugins_registry/openshift.io_gpu_mlnx_ib0.sock: use of closed network connection\"\n"
kubelet[1223475]: I0605 16:48:30.298096 1223475 manager.go:278] "Processed device updates for resource" resourceName="openshift.io/gpu_mlnx_ib0" totalCount=2 healthyCount=2

Even if I changed image version of all components to latest, but this issue still occurs.

I'm using A100 and H100 nodes.

Component Versions

Please fill in the below table with the version numbers of components used.

Component Version
SR-IOV Network Device Plugin latest (I've also tested v3.5.1 and v3.7.0)
SR-IOV CNI Plugin latest (I've also tested sriovCni: v2.6.3 and ibSriovCni: v1.0.2)
Multus v3.8
Kubernetes v1.21.6, v1.28.3
OS ubuntu 20.04, 22.04

Config Files

Config file locations may be config dependent.

Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
CNI config (Try '/etc/cni/net.d/')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)
Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
SchSeba commented 2 weeks ago

Hi @jslouisyou can you share the device plugin configmap please

jslouisyou commented 2 weeks ago

Hi @SchSeba , I found there are 2 configmaps in sriov-network-operator namespace named as device-plugin-config and supported-nic-ids and here's the contents.

* `supported-nic-ids`

apiVersion: v1 data: Broadcom_bnxt_BCM57414_2x25G: 14e4 16d7 16dc Broadcom_bnxt_BCM75508_2x100G: 14e4 1750 1806 Intel_i40e_10G_X710_SFP: 8086 1572 154c Intel_i40e_25G_SFP28: 8086 158b 154c Intel_i40e_40G_XL710_QSFP: 8086 1583 154c Intel_i40e_XXV710: 8086 158a 154c Intel_i40e_XXV710_N3000: 8086 0d58 154c Intel_ice_Columbiaville_E810: 8086 1591 1889 Intel_ice_Columbiaville_E810-CQDA2_2CQDA2: 8086 1592 1889 Intel_ice_Columbiaville_E810-XXVDA2: 8086 159b 1889 Intel_ice_Columbiaville_E810-XXVDA4: 8086 1593 1889 Nvidia_mlx5_ConnectX-4: 15b3 1013 1014 Nvidia_mlx5_ConnectX-4LX: 15b3 1015 1016 Nvidia_mlx5_ConnectX-5: 15b3 1017 1018 Nvidia_mlx5_ConnectX-5_Ex: 15b3 1019 101a Nvidia_mlx5_ConnectX-6: 15b3 101b 101c Nvidia_mlx5_ConnectX-6_Dx: 15b3 101d 101e Nvidia_mlx5_ConnectX-7: 15b3 1021 101e Nvidia_mlx5_MT42822_BlueField-2_integrated_ConnectX-6_Dx: 15b3 a2d6 101e Qlogic_qede_QL45000_50G: 1077 1654 1664 Red_Hat_Virtio_network_device: 1af4 1000 1000 kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: sriov-network-operator meta.helm.sh/release-namespace: sriov-network-operator creationTimestamp: "2024-06-05T05:29:22Z" labels: app.kubernetes.io/managed-by: Helm name: supported-nic-ids namespace: sriov-network-operator resourceVersion: "10770" uid: 15d5826e-2e56-4094-8a60-1567beda154b