Closed onedr0p closed 11 months ago
Same issue here
Thanks for reporting this. We'll look into it.
I wasn't able to reproduce this.
@mitchross or @onedr0p can you provide some details of the environment? For example versions for nfd, helm and k8s.
having the same issue, here's details of my environment: nfd 0.14.1
, k8s (k3s): v1.28.2+k3s1
, deployed with flux, using: helm-controller:v0.36.1, image-automation-controller:v0.36.1, image-reflector-controller:v0.30.0, kustomize-controller:v1.1.0, notification-controller:v1.1.0
and source-controller:v1.1.1
.
I'm using the same set of tools and versions as @arthurgeek
As the error message is about label, I think it's about an empty key here: https://github.com/intel/helm-charts/blob/main/charts/gpu-device-plugin/templates/gpu.yaml#L61 (and two other rules below)
As I haven't been able to reproduce this, if any of you can tweak a local chart and remove the empty "labels", and try to upgrade again?
@tkatila I'll let others try this, unfortunately I don't have time earlier this week. But, you're right, here's a kustomize build
diff for versions 0.27.1
and 0.28.0
:
--- kubernetes HelmRelease: kube-system/intel-device-plugin-gpu GpuDevicePlugin: kube-system/intel-gpu-plugin
+++ kubernetes HelmRelease: kube-system/intel-device-plugin-gpu GpuDevicePlugin: kube-system/intel-gpu-plugin
@@ -1,14 +1,13 @@
---
apiVersion: deviceplugin.intel.com/v1
kind: GpuDevicePlugin
metadata:
name: intel-gpu-plugin
spec:
- image: intel/intel-gpu-plugin:0.27.1
- initImage: intel/intel-gpu-initcontainer:0.27.1
+ image: intel/intel-gpu-plugin:0.28.0
logLevel: 2
sharedDevNum: 3
resourceManager: false
enableMonitoring: true
preferredAllocationPolicy: none
nodeSelector:
--- kubernetes HelmRelease: kube-system/intel-device-plugin-gpu NodeFeatureRule: kube-system/intel-gpu-platform-labeling
+++ kubernetes HelmRelease: kube-system/intel-device-plugin-gpu NodeFeatureRule: kube-system/intel-gpu-platform-labeling
@@ -0,0 +1,206 @@
+---
+apiVersion: nfd.k8s-sigs.io/v1alpha1
+kind: NodeFeatureRule
+metadata:
+ name: intel-gpu-platform-labeling
+spec:
+ rules:
+ - extendedResources:
+ gpu.intel.com/millicores: '@local.label.gpu.intel.com/millicores'
+ gpu.intel.com/memory.max: '@local.label.gpu.intel.com/memory.max'
+ gpu.intel.com/tiles: '@local.label.gpu.intel.com/tiles'
+ matchFeatures:
+ - feature: local.label
+ matchExpressions:
+ gpu.intel.com/millicores:
+ op: Exists
+ gpu.intel.com/memory.max:
+ op: Exists
+ gpu.intel.com/tiles:
+ op: Exists
+ name: intel.gpu.fractionalresources
+ - labels: null
+ labelsTemplate: |
+ {{ range .pci.device }}gpu.intel.com/device-id.{{ .class }}-{{ .device }}.present=true
+ {{ end }}
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - '0300'
+ - 0380
+ vendor:
+ op: In
+ value:
+ - '8086'
+ name: intel.gpu.generic.deviceid
+ - labels: null
+ labelsTemplate: gpu.intel.com/device-id.0300-{{ (index .pci.device 0).device }}.count={{
+ len .pci.device }}
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - '0300'
+ vendor:
+ op: In
+ value:
+ - '8086'
+ name: intel.gpu.generic.count.300
+ - labels: null
+ labelsTemplate: gpu.intel.com/device-id.0380-{{ (index .pci.device 0).device }}.count={{
+ len .pci.device }}
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - 0380
+ vendor:
+ op: In
+ value:
+ - '8086'
+ name: intel.gpu.generic.count.380
+ - labels:
+ gpu.intel.com/product: Max_1100
+ labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - 0380
+ vendor:
+ op: In
+ value:
+ - '8086'
+ device:
+ op: In
+ value:
+ - 0bda
+ name: intel.gpu.max.1100
+ - labels:
+ gpu.intel.com/product: Max_1550
+ labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - 0380
+ vendor:
+ op: In
+ value:
+ - '8086'
+ device:
+ op: In
+ value:
+ - 0bd5
+ name: intel.gpu.max.1550
+ - labels:
+ gpu.intel.com/family: Max_Series
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - 0380
+ vendor:
+ op: In
+ value:
+ - '8086'
+ device:
+ op: In
+ value:
+ - 0bda
+ - 0bd5
+ - 0bd9
+ - 0bdb
+ - 0bd7
+ - 0bd6
+ - 0bd0
+ name: intel.gpu.max.series
+ - labels:
+ gpu.intel.com/family: Flex_Series
+ gpu.intel.com/product: Flex_170
+ labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - 0380
+ vendor:
+ op: In
+ value:
+ - '8086'
+ device:
+ op: In
+ value:
+ - 56c0
+ name: intel.gpu.flex.170
+ - labels:
+ gpu.intel.com/family: Flex_Series
+ gpu.intel.com/product: Flex_140
+ labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - 0380
+ vendor:
+ op: In
+ value:
+ - '8086'
+ device:
+ op: In
+ value:
+ - 56c1
+ name: intel.gpu.flex.140
+ - labels:
+ gpu.intel.com/family: A_Series
+ matchFeatures:
+ - feature: pci.device
+ matchExpressions:
+ class:
+ op: In
+ value:
+ - '0300'
+ vendor:
+ op: In
+ value:
+ - '8086'
+ device:
+ op: In
+ value:
+ - 56a6
+ - 56a5
+ - 56a1
+ - 56a0
+ - '5694'
+ - '5693'
+ - '5692'
+ - '5691'
+ - '5690'
+ - 56b3
+ - 56b2
+ - 56a4
+ - 56a3
+ - '5697'
+ - '5696'
+ - '5695'
+ - 56b1
+ - 56b0
+ name: intel.gpu.a.series
+
@tkatila any plans for the new release, or any ways we can test the PR code to ensure it works? I tried rolling back to 0.27
but it didn't worked out.
I have this issue too. I had to roll back to 0.27.1 to get back up and running.
We haven't been able to reproduce this. @tkatila created a PR so if someone with the error could help test that. If it fixes the error, we can it released asap.
As a workaround, it's possible to install with nodeFeatureRule
boolean set to false.
thanks @mythi. do you maintain a helm chart repo for pull requests, or what is the best way to test that PR out? sorry, I'm new to k8s/helm world.
@arthurgeek the best way to help with testing is to git clone @tkatila's repo for that fix PR and helm package/install the gpu chart as a local package
hey, we merged a potential fix. please re-open this issue if the problem still appears with 0.28.1-helm.0
.
Confirmed that helm chart is working, thanks!
helm values
error