k8snetworkplumbingwg / sriov-network-device-plugin

SRIOV network device plugin for Kubernetes
Apache License 2.0
406 stars 177 forks source link

Can't detect/add Mellanox ConnectX-6 VFs via the plugin on my Openshift(on Openstack installation) #572

Open nmcconom opened 4 months ago

nmcconom commented 4 months ago

What happened?

I have configured the plugin look for my Mellanox ConnectX-6 VFs on my nodes - they are there and appear to be detected on the node but they are never added to the Resource Pools for some reason

What did you expect to happen?

VFs pulled into the respective pools so they can be used in my pods

What are the minimal steps needed to reproduce the bug?

Mellanox ConnectX-6 VFs made available on one or more of your Openshift nodes and configured plugin to try and find them

Anything else we need to know?

lspci output from node 05:00.0 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] Subsystem: Mellanox Technologies Device [15b3:0012] Physical Slot: 0-4 Flags: bus master, fast devsel, latency 0 Memory at fba00000 (64-bit, prefetchable) [size=1M] Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [9c] MSI-X: Enable+ Count=12 Masked- Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?> Kernel driver in use: mlx5_core Kernel modules: mlx5_core

06:00.0 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] Subsystem: Mellanox Technologies Device [15b3:0012] Physical Slot: 0-5 Flags: bus master, fast devsel, latency 0 Memory at fb800000 (64-bit, prefetchable) [size=1M] Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [9c] MSI-X: Enable+ Count=12 Masked- Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?> Kernel driver in use: mlx5_core Kernel modules: mlx5_core

Component Versions

Please fill in the below table with the version numbers of components used.

Component Version
SR-IOV Network Device Plugin 3.7.0
SR-IOV CNI Plugin Openshift 4.12.42
Multus Openshift 4.12.42
Kubernetes 1.25
OS Openshift 4.12/RHCOS 8.6

Config Files

ConfigMap apiVersion: v1 kind: ConfigMap metadata: name: sriovdp-config namespace: kube-system data: config.json: | { "resourceList": [ { "resourceName": "sriov_client_side", "resourcePrefix": "mellanox", "selectors": { "vendors": ["15b3"], "devices": ["101e"], "drivers": ["netdevice"], "pciAddresses": ["0000:00:05.0"] } }, { "resourceName": "sriov_server_side", "resourcePrefix": "mellanox", "selectors": { "vendors": ["15b3"], "devices": ["101e"], "drivers": ["netdevice"], "pciAddresses": ["0000:00:06.0"] } } ] }

Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')

{"cniVersion":"0.4.0","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxsize":100,"logfile-maxbackups":5,"logfile-maxage":5}sh-4.4#

CNI config (Try '/etc/cni/net.d/')

{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/10-ovn-kubernetes.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ {"cniVersion":"0.4.0","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxsize":100,"logfile-maxbackups":5,"logfile-maxage":5} ] }

Kubernetes deployment type ( Bare Metal, Kubeadm etc.)

Openshift 4.12.42

Kubeconfig file
SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)

I0710 11:28:06.727499 1 manager.go:57] Using Kubelet Plugin Registry Mode I0710 11:28:06.727846 1 main.go:46] resource manager reading configs I0710 11:28:06.727909 1 manager.go:86] raw ResourceList: { "resourceList": [ { "resourceName": "sriov_client_side", "resourcePrefix": "mellanox", "selectors": { "vendors": ["15b3"], "devices": ["101e"], "drivers": ["netdevice"], "pciAddresses": ["0000:00:05.0"] } }, { "resourceName": "sriov_server_side", "resourcePrefix": "mellanox", "selectors": { "vendors": ["15b3"], "devices": ["101e"], "drivers": ["netdevice"], "pciAddresses": ["0000:00:06.0"] } } ] } I0710 11:28:06.728042 1 factory.go:211] types.NetDeviceSelectors for resource sriov_client_side is [0xc00042a900] I0710 11:28:06.728085 1 factory.go:211] types.NetDeviceSelectors for resource sriov_server_side is [0xc00042ac60] I0710 11:28:06.728092 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000400eb8 AdditionalInfo:map[] SelectorObjs:[0xc00042a900]} {ResourcePrefix:mellanox ResourceName:sriov_server_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000400ed0 AdditionalInfo:map[] SelectorObjs:[0xc00042ac60]}] I0710 11:28:06.728152 1 manager.go:217] validating resource name "mellanox/sriov_client_side" I0710 11:28:06.728203 1 manager.go:217] validating resource name "mellanox/sriov_server_side" I0710 11:28:06.728210 1 main.go:62] Discovering host devices I0710 11:28:06.845790 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device I0710 11:28:06.845883 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function I0710 11:28:06.845894 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function I0710 11:28:06.845901 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device I0710 11:28:06.845942 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function I0710 11:28:06.846313 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function I0710 11:28:06.846510 1 main.go:68] Initializing resource servers I0710 11:28:06.846526 1 manager.go:117] number of config: 2 I0710 11:28:06.846544 1 manager.go:121] Creating new ResourcePool: sriov_client_side I0710 11:28:06.846548 1 manager.go:122] DeviceType: netDevice I0710 11:28:06.847037 1 manager.go:138] initServers(): selector index 0 will register 0 devices I0710 11:28:06.847055 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_client_side I0710 11:28:06.847061 1 manager.go:121] Creating new ResourcePool: sriov_server_side I0710 11:28:06.847066 1 manager.go:122] DeviceType: netDevice I0710 11:28:06.847495 1 manager.go:138] initServers(): selector index 0 will register 0 devices I0710 11:28:06.847512 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_server_side I0710 11:28:06.847518 1 main.go:74] Starting all servers... I0710 11:28:06.847523 1 main.go:79] All servers started. I0710 11:28:06.847529 1 main.go:80] Listening for term signals

Multus logs (If enabled. Try '/var/log/multus.log' )

2024-07-10T11:04:25+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_f3bb1262-de44-46c1-8d11-2b04b60ac649 to /host/opt/cni/bin/ 2024-07-10T11:04:25+00:00 WARN: {unknown parameter "-"} 2024-07-10T11:04:25+00:00 Entrypoint skipped copying Multus binary. 2024-07-10T11:04:25+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d... 2024-07-10T11:04:25+00:00 Attempting to find master plugin configuration, attempt 0 2024-07-10T11:04:29+00:00 Using MASTER_PLUGIN: 10-ovn-kubernetes.conf 2024-07-10T11:04:29+00:00 Nested capabilities string: 2024-07-10T11:04:29+00:00 Using /host/var/run/multus/cni/net.d/10-ovn-kubernetes.conf as a source to generate the Multus configuration 2024-07-10T11:04:29+00:00 Config file created @ /host/etc/cni/net.d/00-multus.conf { "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/10-ovn-kubernetes.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ {"cniVersion":"0.4.0","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxsize":100,"logfile-maxbackups":5,"logfile-maxage":5} ] } 2024-07-10T11:04:29+00:00 Entering watch loop...

Kubelet logs (journalctl -u kubelet)
SchSeba commented 4 months ago

The PCI address in the config is not right.

your config: "pciAddresses": ["0000:00:06.0"] the device discovered by the device plugin 0000:06:00.0

nmcconom commented 4 months ago

Hi - I corrected that error in the ConfigMap - but it was still the same end result of 0 devices being added

See updated output log below

I0715 12:48:59.624977       1 manager.go:57] Using Kubelet Plugin Registry Mode
I0715 12:48:59.626222       1 main.go:46] resource manager reading configs
I0715 12:48:59.626341       1 manager.go:86] raw ResourceList: {
"resourceList": [
    {
        "resourceName": "sriov_client_side",
        "resourcePrefix": "mellanox",
        "selectors": {
            "vendors": ["15b3"],
            "devices": ["101e"],
            "drivers": ["netdevice"],
            "pciAddresses": ["0000:05:00.0"]
        }
    },
    {
        "resourceName": "sriov_server_side",
        "resourcePrefix": "mellanox",
        "selectors": {
            "vendors": ["15b3"],
            "devices": ["101e"],
            "drivers": ["netdevice"],
            "pciAddresses": ["0000:06:00.0"]
        }
    }
  ]
}
I0715 12:48:59.626637       1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc00017c240]
I0715 12:48:59.626668       1 factory.go:211] *types.NetDeviceSelectors for resource sriov_server_side is [0xc00017c5a0]
I0715 12:48:59.626675       1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00012c330 AdditionalInfo:map[] SelectorObjs:[0xc00017c240]} {ResourcePrefix:mellanox ResourceName:sriov_server_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00012c348 AdditionalInfo:map[] SelectorObjs:[0xc00017c5a0]}]
I0715 12:48:59.626862       1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0715 12:48:59.626893       1 manager.go:217] validating resource name "mellanox/sriov_server_side"
I0715 12:48:59.627022       1 main.go:62] Discovering host devices
I0715 12:48:59.726479       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02              Red Hat, Inc.           Virtio 1.0 network device               
I0715 12:48:59.726578       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02              Mellanox Technolo...    ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727010       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02              Mellanox Technolo...    ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727205       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0   02              Red Hat, Inc.           Virtio 1.0 network device               
I0715 12:48:59.727231       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0   02              Mellanox Technolo...    ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727237       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0   02              Mellanox Technolo...    ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727250       1 main.go:68] Initializing resource servers
I0715 12:48:59.727256       1 manager.go:117] number of config: 2
I0715 12:48:59.727267       1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0715 12:48:59.727273       1 manager.go:122] DeviceType: netDevice
I0715 12:48:59.727797       1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0715 12:48:59.727813       1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_client_side
I0715 12:48:59.727819       1 manager.go:121] Creating new ResourcePool: sriov_server_side
I0715 12:48:59.727824       1 manager.go:122] DeviceType: netDevice
I0715 12:48:59.756721       1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0715 12:48:59.756744       1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_server_side
I0715 12:48:59.756750       1 main.go:74] Starting all servers...
I0715 12:48:59.756757       1 main.go:79] All servers started.
I0715 12:48:59.756762       1 main.go:80] Listening for term signals
SchSeba commented 4 months ago

one more step for virtual env can you remove

"vendors": ["15b3"],
            "devices": ["101e"],
            "drivers": ["netdevice"],

from the configmap please leave only the pciAddress

nmcconom commented 4 months ago

I tried that but with same end result unfortunately.

Logs below for that attempt

I0717 13:12:36.794382 1 manager.go:57] Using Kubelet Plugin Registry Mode
I0717 13:12:36.794710 1 main.go:46] resource manager reading configs
I0717 13:12:36.794782 1 manager.go:86] raw ResourceList: {
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:05:00.0"]
}
},
{
"resourceName": "sriov_server_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:06:00.0"]
}
}
]
}
I0717 13:12:36.794930 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc0004070e0]
I0717 13:12:36.794955 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_server_side is [0xc000407440]
I0717 13:12:36.794962 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc0003f2ed0 AdditionalInfo:map[] SelectorObjs:[0xc0004070e0]} {ResourcePrefix:mellanox ResourceName:sriov_server_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc0003f2ee8 AdditionalInfo:map[] SelectorObjs:[0xc000407440]}]
I0717 13:12:36.795051 1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0717 13:12:36.795075 1 manager.go:217] validating resource name "mellanox/sriov_server_side"
I0717 13:12:36.795081 1 main.go:62] Discovering host devices
I0717 13:12:36.876613 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 13:12:36.876675 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.876683 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.876690 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 13:12:36.876747 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.877186 1 utils.go:494] excluding interface enp5s0: default route found: {Ifindex: 3 Dst: <nil> Src: 172.26.13.75 Gw: 172.26.13.1 Flags: [] Table: 254 Realm: 0}
I0717 13:12:36.877254 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.877429 1 utils.go:494] excluding interface enp6s0: default route found: {Ifindex: 4 Dst: <nil> Src: 172.26.14.175 Gw: 172.26.14.1 Flags: [] Table: 254 Realm: 0}
I0717 13:12:36.877455 1 main.go:68] Initializing resource servers
I0717 13:12:36.877463 1 manager.go:117] number of config: 2
I0717 13:12:36.877469 1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0717 13:12:36.877487 1 manager.go:122] DeviceType: netDevice
I0717 13:12:36.877633 1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0717 13:12:36.877649 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_client_side
I0717 13:12:36.877655 1 manager.go:121] Creating new ResourcePool: sriov_server_side
I0717 13:12:36.877659 1 manager.go:122] DeviceType: netDevice
I0717 13:12:36.877749 1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0717 13:12:36.877762 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_server_side
I0717 13:12:36.877766 1 main.go:74] Starting all servers...
I0717 13:12:36.877772 1 main.go:79] All servers started.
I0717 13:12:36.877777 1 main.go:80] Listening for term signals
nmcconom commented 4 months ago

Noticed below line so brought the interface down before restarting device plugin pod

I0717 13:12:36.877186 1 utils.go:494] excluding interface enp5s0: default route found: {Ifindex: 3 Dst: <nil> Src: 172.26.13.75 Gw: 172.26.13.1 Flags: [] Table: 254 Realm: 0}

That seems to allow it to discover them OK.

I0717 14:14:23.049116 1 manager.go:57] Using Kubelet Plugin Registry Mode
I0717 14:14:23.050546 1 main.go:46] resource manager reading configs
I0717 14:14:23.050650 1 manager.go:86] raw ResourceList: {
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:05:00.0"]
}
},
{
"resourceName": "sriov_internet_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:06:00.0"]
}
}
]
}
I0717 14:14:23.051034 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc0001d8240]
I0717 14:14:23.051117 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_internet_side is [0xc0000df440]
I0717 14:14:23.051970 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00019a330 AdditionalInfo:map[] SelectorObjs:[0xc0001d8240]} {ResourcePrefix:mellanox ResourceName:sriov_internet_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00019a348 AdditionalInfo:map[] SelectorObjs:[0xc0000df440]}]
I0717 14:14:23.052090 1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0717 14:14:23.052128 1 manager.go:217] validating resource name "mellanox/sriov_internet_side"
I0717 14:14:23.052139 1 main.go:62] Discovering host devices
I0717 14:14:23.136162 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 14:14:23.136287 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.136791 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.137021 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 14:14:23.137054 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.137062 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.137081 1 main.go:68] Initializing resource servers
I0717 14:14:23.137088 1 manager.go:117] number of config: 2
I0717 14:14:23.137101 1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0717 14:14:23.137106 1 manager.go:122] DeviceType: netDevice
I0717 14:14:23.137648 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 14:14:23.137683 1 factory.go:124] device added: [identifier: 0000:05:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 14:14:23.137722 1 manager.go:156] New resource server is created for sriov_client_side ResourcePool
I0717 14:14:23.137731 1 manager.go:121] Creating new ResourcePool: sriov_internet_side
I0717 14:14:23.137736 1 manager.go:122] DeviceType: netDevice
I0717 14:14:23.138214 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 14:14:23.138237 1 factory.go:124] device added: [identifier: 0000:06:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 14:14:23.138253 1 manager.go:156] New resource server is created for sriov_internet_side ResourcePool
I0717 14:14:23.138260 1 main.go:74] Starting all servers...
I0717 14:14:23.138492 1 server.go:255] starting sriov_client_side device plugin endpoint at: mellanox_sriov_client_side.sock
I0717 14:14:23.139287 1 server.go:297] sriov_client_side device plugin endpoint started serving
I0717 14:14:23.139413 1 server.go:255] starting sriov_internet_side device plugin endpoint at: mellanox_sriov_internet_side.sock
I0717 14:14:23.139732 1 server.go:297] sriov_internet_side device plugin endpoint started serving
I0717 14:14:23.139752 1 main.go:79] All servers started.
I0717 14:14:23.139759 1 main.go:80] Listening for term signals
nmcconom commented 4 months ago

Any idea why it didn't like the more specific filters? We were able to use these with our Intel based cards.

nmcconom commented 4 months ago

Added back in the vendors and devices attributes and that worked also - so it seemed it didn't like the netdevice driver

We use vfio-pci for our Intel cards and Openshift documentation had pointed us at setting netdevice for Mellanox cards - just for background on why we had used that

I0717 16:42:30.242700 1 manager.go:86] raw ResourceList: {
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"pciAddresses": ["0000:05:00.0"]
}
},
{
"resourceName": "sriov_internet_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"pciAddresses": ["0000:06:00.0"]
}
}
]
}
I0717 16:42:30.242817 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc00052a900]
I0717 16:42:30.242846 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_internet_side is [0xc00052ac60]
I0717 16:42:30.242852 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000500e88 AdditionalInfo:map[] SelectorObjs:[0xc00052a900]} {ResourcePrefix:mellanox ResourceName:sriov_internet_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000500ea0 AdditionalInfo:map[] SelectorObjs:[0xc00052ac60]}]
I0717 16:42:30.242942 1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0717 16:42:30.242967 1 manager.go:217] validating resource name "mellanox/sriov_internet_side"
I0717 16:42:30.242973 1 main.go:62] Discovering host devices
I0717 16:42:30.320299 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 16:42:30.320385 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.320394 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.320403 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 16:42:30.320463 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.320866 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.321290 1 main.go:68] Initializing resource servers
I0717 16:42:30.321316 1 manager.go:117] number of config: 2
I0717 16:42:30.321338 1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0717 16:42:30.321347 1 manager.go:122] DeviceType: netDevice
I0717 16:42:30.322346 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 16:42:30.322390 1 factory.go:124] device added: [identifier: 0000:05:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 16:42:30.322444 1 manager.go:156] New resource server is created for sriov_client_side ResourcePool
I0717 16:42:30.322460 1 manager.go:121] Creating new ResourcePool: sriov_internet_side
I0717 16:42:30.322464 1 manager.go:122] DeviceType: netDevice
I0717 16:42:30.322978 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 16:42:30.323000 1 factory.go:124] device added: [identifier: 0000:06:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 16:42:30.323027 1 manager.go:156] New resource server is created for sriov_internet_side ResourcePool
I0717 16:42:30.323035 1 main.go:74] Starting all servers...
I0717 16:42:30.323324 1 server.go:255] starting sriov_client_side device plugin endpoint at: mellanox_sriov_client_side.sock
I0717 16:42:30.324284 1 server.go:297] sriov_client_side device plugin endpoint started serving
I0717 16:42:30.324699 1 server.go:255] starting sriov_internet_side device plugin endpoint at: mellanox_sriov_internet_side.sock
I0717 16:42:30.325092 1 server.go:297] sriov_internet_side device plugin endpoint started serving
I0717 16:42:30.325115 1 main.go:79] All servers started.
I0717 16:42:30.325123 1 main.go:80] Listening for term signals
I0717 16:42:30.780189 1 server.go:117] Plugin: mellanox_sriov_client_side.sock gets registered successfully at Kubelet
I0717 16:42:30.780439 1 server.go:117] Plugin: mellanox_sriov_internet_side.sock gets registered successfully at Kubelet
I0717 16:42:30.780571 1 server.go:158] ListAndWatch(sriov_client_side) invoked
I0717 16:42:30.780621 1 server.go:171] ListAndWatch(sriov_client_side): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:05:00.0,Health:Healthy,Topology:nil,},},}
I0717 16:42:30.780561 1 server.go:158] ListAndWatch(sriov_internet_side) invoked
I0717 16:42:30.780719 1 server.go:171] ListAndWatch(sriov_internet_side): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:06:00.0,Health:Healthy,Topology:nil,},},}
SchSeba commented 2 months ago

That is because in this case where the device plugin runs on a VM where only the VFs exist (and not the all PF) it's not a netdevice.

please check the shiftonstack documentation. the openshift documentation is for baremetal where the VFs for mellanox devices should be netdevice

SchSeba commented 2 months ago

let me know if I can close this issue :)

nmcconom commented 2 months ago

Yes please go ahead and close.

Thanks for the help