intel / helm-charts

Apache License 2.0
12 stars 15 forks source link

Issue upgrading helm chart `intel-device-plugins-gpu` from `v0.27.1` to `v0.28.0` #41

Closed onedr0p closed 11 months ago

onedr0p commented 11 months ago

helm values

name: intel-device-plugin-gpu
sharedDevNum: 3
nodeFeatureRule: true

error

Invalid: NodeFeatureRule.nfd.k8s-sigs.io "intel-gpu-platform-labeling" is invalid: [spec.rules[1].labels: Invalid value: "null": spec.rules[1].labels in body must be of type object: "null", spec.rules[2].labels: Invalid value: "null": spec.rules[2].labels in body must be of type object: "null", spec.rules[3].labels: Invalid value: "null": spec.rules[3].labels in body must be of type object: "null"]
mitchross commented 11 months ago

Same issue here

tkatila commented 11 months ago

Thanks for reporting this. We'll look into it.

tkatila commented 11 months ago

I wasn't able to reproduce this.

@mitchross or @onedr0p can you provide some details of the environment? For example versions for nfd, helm and k8s.

arthurgeek commented 11 months ago

having the same issue, here's details of my environment: nfd 0.14.1, k8s (k3s): v1.28.2+k3s1, deployed with flux, using: helm-controller:v0.36.1, image-automation-controller:v0.36.1, image-reflector-controller:v0.30.0, kustomize-controller:v1.1.0, notification-controller:v1.1.0 and source-controller:v1.1.1.

onedr0p commented 11 months ago

I'm using the same set of tools and versions as @arthurgeek

tkatila commented 11 months ago

As the error message is about label, I think it's about an empty key here: https://github.com/intel/helm-charts/blob/main/charts/gpu-device-plugin/templates/gpu.yaml#L61 (and two other rules below)

As I haven't been able to reproduce this, if any of you can tweak a local chart and remove the empty "labels", and try to upgrade again?

arthurgeek commented 11 months ago

@tkatila I'll let others try this, unfortunately I don't have time earlier this week. But, you're right, here's a kustomize build diff for versions 0.27.1 and 0.28.0:

--- kubernetes HelmRelease: kube-system/intel-device-plugin-gpu GpuDevicePlugin: kube-system/intel-gpu-plugin

+++ kubernetes HelmRelease: kube-system/intel-device-plugin-gpu GpuDevicePlugin: kube-system/intel-gpu-plugin

@@ -1,14 +1,13 @@

 ---
 apiVersion: deviceplugin.intel.com/v1
 kind: GpuDevicePlugin
 metadata:
   name: intel-gpu-plugin
 spec:
-  image: intel/intel-gpu-plugin:0.27.1
-  initImage: intel/intel-gpu-initcontainer:0.27.1
+  image: intel/intel-gpu-plugin:0.28.0
   logLevel: 2
   sharedDevNum: 3
   resourceManager: false
   enableMonitoring: true
   preferredAllocationPolicy: none
   nodeSelector:
--- kubernetes HelmRelease: kube-system/intel-device-plugin-gpu NodeFeatureRule: kube-system/intel-gpu-platform-labeling

+++ kubernetes HelmRelease: kube-system/intel-device-plugin-gpu NodeFeatureRule: kube-system/intel-gpu-platform-labeling

@@ -0,0 +1,206 @@

+---
+apiVersion: nfd.k8s-sigs.io/v1alpha1
+kind: NodeFeatureRule
+metadata:
+  name: intel-gpu-platform-labeling
+spec:
+  rules:
+  - extendedResources:
+      gpu.intel.com/millicores: '@local.label.gpu.intel.com/millicores'
+      gpu.intel.com/memory.max: '@local.label.gpu.intel.com/memory.max'
+      gpu.intel.com/tiles: '@local.label.gpu.intel.com/tiles'
+    matchFeatures:
+    - feature: local.label
+      matchExpressions:
+        gpu.intel.com/millicores:
+          op: Exists
+        gpu.intel.com/memory.max:
+          op: Exists
+        gpu.intel.com/tiles:
+          op: Exists
+    name: intel.gpu.fractionalresources
+  - labels: null
+    labelsTemplate: |
+      {{ range .pci.device }}gpu.intel.com/device-id.{{ .class }}-{{ .device }}.present=true
+      {{ end }}
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - '0300'
+          - 0380
+        vendor:
+          op: In
+          value:
+          - '8086'
+    name: intel.gpu.generic.deviceid
+  - labels: null
+    labelsTemplate: gpu.intel.com/device-id.0300-{{ (index .pci.device 0).device }}.count={{
+      len .pci.device }}
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - '0300'
+        vendor:
+          op: In
+          value:
+          - '8086'
+    name: intel.gpu.generic.count.300
+  - labels: null
+    labelsTemplate: gpu.intel.com/device-id.0380-{{ (index .pci.device 0).device }}.count={{
+      len .pci.device }}
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - 0380
+        vendor:
+          op: In
+          value:
+          - '8086'
+    name: intel.gpu.generic.count.380
+  - labels:
+      gpu.intel.com/product: Max_1100
+    labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - 0380
+        vendor:
+          op: In
+          value:
+          - '8086'
+        device:
+          op: In
+          value:
+          - 0bda
+    name: intel.gpu.max.1100
+  - labels:
+      gpu.intel.com/product: Max_1550
+    labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - 0380
+        vendor:
+          op: In
+          value:
+          - '8086'
+        device:
+          op: In
+          value:
+          - 0bd5
+    name: intel.gpu.max.1550
+  - labels:
+      gpu.intel.com/family: Max_Series
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - 0380
+        vendor:
+          op: In
+          value:
+          - '8086'
+        device:
+          op: In
+          value:
+          - 0bda
+          - 0bd5
+          - 0bd9
+          - 0bdb
+          - 0bd7
+          - 0bd6
+          - 0bd0
+    name: intel.gpu.max.series
+  - labels:
+      gpu.intel.com/family: Flex_Series
+      gpu.intel.com/product: Flex_170
+    labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - 0380
+        vendor:
+          op: In
+          value:
+          - '8086'
+        device:
+          op: In
+          value:
+          - 56c0
+    name: intel.gpu.flex.170
+  - labels:
+      gpu.intel.com/family: Flex_Series
+      gpu.intel.com/product: Flex_140
+    labelsTemplate: gpu.intel.com/device.count={{ len .pci.device }}
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - 0380
+        vendor:
+          op: In
+          value:
+          - '8086'
+        device:
+          op: In
+          value:
+          - 56c1
+    name: intel.gpu.flex.140
+  - labels:
+      gpu.intel.com/family: A_Series
+    matchFeatures:
+    - feature: pci.device
+      matchExpressions:
+        class:
+          op: In
+          value:
+          - '0300'
+        vendor:
+          op: In
+          value:
+          - '8086'
+        device:
+          op: In
+          value:
+          - 56a6
+          - 56a5
+          - 56a1
+          - 56a0
+          - '5694'
+          - '5693'
+          - '5692'
+          - '5691'
+          - '5690'
+          - 56b3
+          - 56b2
+          - 56a4
+          - 56a3
+          - '5697'
+          - '5696'
+          - '5695'
+          - 56b1
+          - 56b0
+    name: intel.gpu.a.series
+
arthurgeek commented 11 months ago

@tkatila any plans for the new release, or any ways we can test the PR code to ensure it works? I tried rolling back to 0.27 but it didn't worked out.

binaryn3xus commented 11 months ago

I have this issue too. I had to roll back to 0.27.1 to get back up and running.

mythi commented 11 months ago

We haven't been able to reproduce this. @tkatila created a PR so if someone with the error could help test that. If it fixes the error, we can it released asap.

As a workaround, it's possible to install with nodeFeatureRule boolean set to false.

arthurgeek commented 11 months ago

thanks @mythi. do you maintain a helm chart repo for pull requests, or what is the best way to test that PR out? sorry, I'm new to k8s/helm world.

mythi commented 11 months ago

@arthurgeek the best way to help with testing is to git clone @tkatila's repo for that fix PR and helm package/install the gpu chart as a local package

mythi commented 11 months ago

hey, we merged a potential fix. please re-open this issue if the problem still appears with 0.28.1-helm.0.

onedr0p commented 11 months ago

Confirmed that helm chart is working, thanks!