k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
82 stars 111 forks source link

still need help install sriov-network-operator #672

Open hymgg opened 6 months ago

hymgg commented 6 months ago

Continuing from issue #584,

@adrianchiris Sorry for the late followup.

Install using helm was much easier than following the quick start steps. However, it only brought up the sriov-network-operator pod, according to quick start guide, there should be a sriov-network-config-daemon too?

`$ ls Chart.yaml crds README.md templates values.yaml

$ helm3 install -n sriov-network-operator --create-namespace --wait sriov-network-operator ./

$ kubectl get all -n sriov-network-operator NAME READY STATUS RESTARTS AGE pod/sriov-network-operator-845dc5dffc-4hvsb 1/1 Running 0 20m

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/sriov-network-operator 1/1 1 1 20m

NAME DESIRED CURRENT READY AGE replicaset.apps/sriov-network-operator-845dc5dffc 1 1 1 20m

$ kubectl logs deployment.apps/sriov-network-operator -n sriov-network-operator|tail -5 2024-03-29T05:02:53.668128868Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "ed902977-3a07-4cea-bb20-0cefbff5ea9e"} 2024-03-29T05:02:58.668612364Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"} 2024-03-29T05:02:58.668676704Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"} 2024-03-29T05:03:03.669236989Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"} 2024-03-29T05:03:03.669309844Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"}

Additional info, may not be relevant.

$ kubectl label ns sriov-network-operator pod-security.kubernetes.io/enforce=privileged $ kubectl get node -l node-role.kubernetes.io/worker NAME STATUS ROLES AGE VERSION mtx-dell4-bld01.dc1.matrixxsw.com Ready worker 264d v1.26.6 mtx-dell4-bld02.dc1.matrixxsw.com Ready worker 264d v1.26.6 mtx-dell4-bld03.dc1.matrixxsw.com Ready worker 264d v1.26.6 `

Shall we / how do we get sriov-network-config-daemon installed? Thanks. -Jessica

Originally posted by @hymgg in https://github.com/k8snetworkplumbingwg/sriov-network-operator/issues/584#issuecomment-2026657454

SchSeba commented 1 month ago

Hi @hymgg can you please run lspci find the virtual functions and run lspci -vv -nn -mm -k -s <vf-pci-addr> and can you check that do didn't disable the iavf kernel module with a blacklist or something like that

hymgg commented 1 month ago

@SchSeba thanks for the followup, will reinstall the operator and check with lspci.

SchSeba commented 1 month ago

great I will wait for an update :)

hymgg commented 1 month ago

@SchSeba Found iavf in a blacklist.conf, talking to lab team about this.

`

grep iavf /etc/modprobe.d/*

/etc/modprobe.d/anaconda-blacklist.conf:blacklist iavf

lspci|grep "Virtual Function"

3b:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)

lspci -vv -nn -mm -k -s 3b:0a.0

Slot: 3b:0a.0 Class: Ethernet controller [0200] Vendor: Intel Corporation [8086] Device: Ethernet Virtual Function 700 Series [154c] SVendor: Intel Corporation [8086] SDevice: Device [0000] Rev: 02 Module: iavf NUMANode: 0 IOMMUGroup: 152

lspci -vv -nn -mm -k -s 3b:0a.1

Slot: 3b:0a.1 Class: Ethernet controller [0200] Vendor: Intel Corporation [8086] Device: Ethernet Virtual Function 700 Series [154c] SVendor: Intel Corporation [8086] SDevice: Device [0000] Rev: 02 Module: iavf NUMANode: 0 IOMMUGroup: 153 `

hymgg commented 4 weeks ago

Removed iavf from blacklist. After reapply the SriovNetworkNodePolicy, pods/node stay alive, node allocatable resource list has "openshift.io/ens1f1": "8", so it's good.

$ cat policy-ens1f1.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ens1f1
  namespace: sriov-network-operator
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
    #feature.node.kubernetes.io/network-sriov.capable: "true"
  resourceName: ens1f1
  priority: 99
  #mtu: 9000
  numVfs: 8
  nicSelector:
      deviceID: "158a"
      rootDevices:
      - 0000:3b:00.1
      vendor: "8086"
  deviceType: netdevice

$ kubectl get node -l node-role.kubernetes.io/worker;kubectl --context dell4 get all -n sriov-network-operator
NAME                                STATUS   ROLES    AGE   VERSION
mtx-dell4-bld01.dc1.matrixxsw.com   Ready    worker   50d   v1.29.6
NAME                                          READY   STATUS    RESTARTS        AGE
pod/sriov-device-plugin-z7qxr                 1/1     Running   0               15s
pod/sriov-network-config-daemon-td8h8         1/1     Running   1 (7m25s ago)   4d20h
pod/sriov-network-operator-55dbb4c9df-q48f4   1/1     Running   0               4d20h

NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                            AGE
daemonset.apps/sriov-device-plugin           1         1         1       1            1           kubernetes.io/os=linux,node-role.kubernetes.io/worker=   19s
daemonset.apps/sriov-network-config-daemon   1         1         1       1            1           kubernetes.io/os=linux,node-role.kubernetes.io/worker=   4d20h

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/sriov-network-operator   1/1     1            1           4d20h

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/sriov-network-operator-55dbb4c9df   1         1         1       4d20h

$ kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]'
[
  {
    "name": "mtx-dell4-bld01.dc1.matrixxsw.com",
    "allocable": {
      "cpu": "64",
      "ephemeral-storage": "213255452729",
      "hugepages-1Gi": "0",
      "hugepages-2Mi": "0",
      "memory": "394187256Ki",
      "openshift.io/ens1f1": "8",
      "pods": "110"
    }
  },
...

Created a SriovNetwork sriovnetwork-ens1f1 using host-local ipam, verified a NetworkAttachmentDefinition with same name auto created, then I created a pod with annotation k8s.v1.cni.cncf.io/networks: sriovnetwork-ens1f1, pod started ok too.

$ cat sriovnetwork-ens1f1.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriovnetwork-ens1f1
  namespace: sriov-network-operator
spec:
  ipam: |
    {
      "type": "host-local",
      "subnet": "100.100.20.0/24",
      "rangeStart": "100.100.20.100",
      "rangeEnd": "100.100.20.200",
      "routes": [{
        "dst": "0.0.0.0/0"
      }],
      "gateway": "100.100.20.1"
    }
  vlan: 20
  resourceName: ens1f1
hymgg commented 4 weeks ago

Next 2 questions,

1.) do we support whereabouts ipam? or what ipam should we use so pods on the same sriov network can talk to each other?

After above success, I deleted test pod, and the SriovNetwork, changed its ipam from host-local to whereabouts, and recreated it. but the pod failed to create, error from describe pod:

ERRORED: error configuring pod [sriov-network-operator/test1] networking: [sriov-network-operator/test1/44964362-090f-4ed3-aff6-21d42757a3aa:sriovnetwork-ens1f1]: error adding container to network "sriovnetwork-ens1f1": IPAM plugin returned missing IP config

2.) how do I create a SriovNetwork in a difference namespace? I tried modify namespace in above SriovNetwork yaml and apply, found nothing in new ns.

Thanks. -Jessica

hymgg commented 2 weeks ago

@SchSeba Could you guide us on the 2 questions above?