Open hymgg opened 6 months ago
Hi @hymgg can you please run lspci
find the virtual functions and run lspci -vv -nn -mm -k -s <vf-pci-addr>
and can you check that do didn't disable the iavf kernel module with a blacklist or something like that
@SchSeba thanks for the followup, will reinstall the operator and check with lspci.
great I will wait for an update :)
@SchSeba Found iavf in a blacklist.conf, talking to lab team about this.
`
/etc/modprobe.d/anaconda-blacklist.conf:blacklist iavf
3b:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.4 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.5 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.6 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) 3b:0a.7 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
Slot: 3b:0a.0 Class: Ethernet controller [0200] Vendor: Intel Corporation [8086] Device: Ethernet Virtual Function 700 Series [154c] SVendor: Intel Corporation [8086] SDevice: Device [0000] Rev: 02 Module: iavf NUMANode: 0 IOMMUGroup: 152
Slot: 3b:0a.1 Class: Ethernet controller [0200] Vendor: Intel Corporation [8086] Device: Ethernet Virtual Function 700 Series [154c] SVendor: Intel Corporation [8086] SDevice: Device [0000] Rev: 02 Module: iavf NUMANode: 0 IOMMUGroup: 153 `
Removed iavf from blacklist. After reapply the SriovNetworkNodePolicy, pods/node stay alive, node allocatable resource list has "openshift.io/ens1f1": "8", so it's good.
$ cat policy-ens1f1.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-ens1f1
namespace: sriov-network-operator
spec:
nodeSelector:
node-role.kubernetes.io/worker: ""
#feature.node.kubernetes.io/network-sriov.capable: "true"
resourceName: ens1f1
priority: 99
#mtu: 9000
numVfs: 8
nicSelector:
deviceID: "158a"
rootDevices:
- 0000:3b:00.1
vendor: "8086"
deviceType: netdevice
$ kubectl get node -l node-role.kubernetes.io/worker;kubectl --context dell4 get all -n sriov-network-operator
NAME STATUS ROLES AGE VERSION
mtx-dell4-bld01.dc1.matrixxsw.com Ready worker 50d v1.29.6
NAME READY STATUS RESTARTS AGE
pod/sriov-device-plugin-z7qxr 1/1 Running 0 15s
pod/sriov-network-config-daemon-td8h8 1/1 Running 1 (7m25s ago) 4d20h
pod/sriov-network-operator-55dbb4c9df-q48f4 1/1 Running 0 4d20h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/sriov-device-plugin 1 1 1 1 1 kubernetes.io/os=linux,node-role.kubernetes.io/worker= 19s
daemonset.apps/sriov-network-config-daemon 1 1 1 1 1 kubernetes.io/os=linux,node-role.kubernetes.io/worker= 4d20h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/sriov-network-operator 1/1 1 1 4d20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/sriov-network-operator-55dbb4c9df 1 1 1 4d20h
$ kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]'
[
{
"name": "mtx-dell4-bld01.dc1.matrixxsw.com",
"allocable": {
"cpu": "64",
"ephemeral-storage": "213255452729",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "394187256Ki",
"openshift.io/ens1f1": "8",
"pods": "110"
}
},
...
Created a SriovNetwork sriovnetwork-ens1f1 using host-local ipam, verified a NetworkAttachmentDefinition with same name auto created, then I created a pod with annotation k8s.v1.cni.cncf.io/networks: sriovnetwork-ens1f1, pod started ok too.
$ cat sriovnetwork-ens1f1.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: sriovnetwork-ens1f1
namespace: sriov-network-operator
spec:
ipam: |
{
"type": "host-local",
"subnet": "100.100.20.0/24",
"rangeStart": "100.100.20.100",
"rangeEnd": "100.100.20.200",
"routes": [{
"dst": "0.0.0.0/0"
}],
"gateway": "100.100.20.1"
}
vlan: 20
resourceName: ens1f1
Next 2 questions,
1.) do we support whereabouts ipam? or what ipam should we use so pods on the same sriov network can talk to each other?
After above success, I deleted test pod, and the SriovNetwork, changed its ipam from host-local to whereabouts, and recreated it. but the pod failed to create, error from describe pod:
ERRORED: error configuring pod [sriov-network-operator/test1] networking: [sriov-network-operator/test1/44964362-090f-4ed3-aff6-21d42757a3aa:sriovnetwork-ens1f1]: error adding container to network "sriovnetwork-ens1f1": IPAM plugin returned missing IP config
2.) how do I create a SriovNetwork in a difference namespace? I tried modify namespace in above SriovNetwork yaml and apply, found nothing in new ns.
Thanks. -Jessica
@SchSeba Could you guide us on the 2 questions above?
Continuing from issue #584,
@adrianchiris Sorry for the late followup.
Install using helm was much easier than following the quick start steps. However, it only brought up the sriov-network-operator pod, according to quick start guide, there should be a sriov-network-config-daemon too?
`$ ls Chart.yaml crds README.md templates values.yaml
$ helm3 install -n sriov-network-operator --create-namespace --wait sriov-network-operator ./
$ kubectl get all -n sriov-network-operator NAME READY STATUS RESTARTS AGE pod/sriov-network-operator-845dc5dffc-4hvsb 1/1 Running 0 20m
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/sriov-network-operator 1/1 1 1 20m
NAME DESIRED CURRENT READY AGE replicaset.apps/sriov-network-operator-845dc5dffc 1 1 1 20m
$ kubectl logs deployment.apps/sriov-network-operator -n sriov-network-operator|tail -5 2024-03-29T05:02:53.668128868Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "ed902977-3a07-4cea-bb20-0cefbff5ea9e"} 2024-03-29T05:02:58.668612364Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"} 2024-03-29T05:02:58.668676704Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"} 2024-03-29T05:03:03.669236989Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"} 2024-03-29T05:03:03.669309844Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"}
Additional info, may not be relevant.
$ kubectl label ns sriov-network-operator pod-security.kubernetes.io/enforce=privileged $ kubectl get node -l node-role.kubernetes.io/worker NAME STATUS ROLES AGE VERSION mtx-dell4-bld01.dc1.matrixxsw.com Ready worker 264d v1.26.6 mtx-dell4-bld02.dc1.matrixxsw.com Ready worker 264d v1.26.6 mtx-dell4-bld03.dc1.matrixxsw.com Ready worker 264d v1.26.6 `
Shall we / how do we get sriov-network-config-daemon installed? Thanks. -Jessica
Originally posted by @hymgg in https://github.com/k8snetworkplumbingwg/sriov-network-operator/issues/584#issuecomment-2026657454