k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
84 stars 114 forks source link

sriov-network-metrics-exporter fails to deploy #766

Closed ianb-mp closed 2 months ago

ianb-mp commented 2 months ago

I've enabled the new featureGate for metricsExporter (https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/655) however I see error in the operator log: DaemonSet in version \"v1\" cannot be handled as a DaemonSet: json: cannot unmarshal bool into Go struct field PodSpec.spec.template.spec.nodeSelector of type string"} - full error:

2024-08-28T00:52:03.971414176Z  ERROR   syncMetricsExporter     controllers/sriovoperatorconfig_controller.go:131       Couldn't sync metrics exporter objects  {"error": "failed to apply object &{map[apiVersion:apps/v1 kind:DaemonSet metadata:map[labels:map[app:sriov-network-metrics-exporter] name:sriov-network-metrics-exporter namespace:sriov-network-operator ownerReferences:[map[apiVersion:sriovnetwork.openshift.io/v1 blockOwnerDeletion:true controller:true kind:SriovOperatorConfig name:default uid:a131844f-34c3-4a41-a739-b1c51ff145d3]]] spec:map[selector:map[matchLabels:map[app:sriov-network-metrics-exporter]] template:map[metadata:map[labels:map[app:sriov-network-metrics-exporter]] spec:map[containers:[map[args:[--web.listen-address=127.0.0.1:9110 --path.kubecgroup=/sys/fs/cgroup --path.sysbuspci=/host/sys/bus/pci/devices/ --path.sysclassnet=/host/sys/class/net/ --path.cpucheckpoint=/host/cpu_manager_state --path.kubeletsocket=/host/kubelet.sock --collector.kubepoddevice=true --collector.vfstatspriority=netlink,sysfs] image:ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:v1.1.0 imagePullPolicy:IfNotPresent name:metrics-exporter resources:map[requests:map[cpu:100m memory:100Mi]] securityContext:map[allowPrivilegeEscalation:false capabilities:map[drop:[ALL]] readOnlyRootFilesystem:true] volumeMounts:[map[mountPath:/host/kubelet.sock name:kubeletsocket] map[mountPath:/host/sys/bus/pci/devices name:sysbuspcidevices readOnly:true] map[mountPath:/host/sys/devices name:sysdevices readOnly:true] map[mountPath:/host/sys/class/net name:sysclassnet readOnly:true] map[mountPath:/host/cpu_manager_state name:cpucheckpoint readOnly:true]]] map[args:[--logtostderr --secure-listen-address=[$(HOST_IP)]:9110 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --upstream=http://127.0.0.1:9110/ --tls-private-key-file=/etc/metrics/tls.key --tls-cert-file=/etc/metrics/tls.crt] env:[map[name:HOST_IP valueFrom:map[fieldRef:map[fieldPath:status.hostIP]]]] image:gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0 imagePullPolicy:IfNotPresent name:kube-rbac-proxy ports:[map[containerPort:9110 name:https-metrics]] resources:map[requests:map[cpu:10m memory:20Mi]] volumeMounts:[map[mountPath:/etc/metrics name:metrics-certs readOnly:true]]]] hostNetwork:true nodeSelector:map[feature.node.kubernetes.io/network-sriov.capable:true] restartPolicy:Always serviceAccountName:metrics-exporter-sa volumes:[map[hostPath:map[path:/var/lib/kubelet/pod-resources/kubelet.sock type:Socket] name:kubeletsocket] map[hostPath:map[path:/var/lib/kubelet/cpu_manager_state type:File] name:cpucheckpoint] map[hostPath:map[path:/sys/class/net type:Directory] name:sysclassnet] map[hostPath:map[path:/sys/bus/pci/devices type:Directory] name:sysbuspcidevices] map[hostPath:map[path:/sys/devices type:Directory] name:sysdevices] map[name:metrics-certs secret:map[defaultMode:420 secretName:metrics-exporter-cert]]]]]]]} with err: could not create (apps/v1, Kind=DaemonSet) sriov-network-operator/sriov-network-metrics-exporter: DaemonSet in version \"v1\" cannot be handled as a DaemonSet: json: cannot unmarshal bool into Go struct field PodSpec.spec.template.spec.nodeSelector of type string"}

SriovOperatorConfig is:

apiVersion: v1
items:
- apiVersion: sriovnetwork.openshift.io/v1
  kind: SriovOperatorConfig
  metadata:
    annotations:
      meta.helm.sh/release-name: sriov-network-operator
      meta.helm.sh/release-namespace: sriov-network-operator
    creationTimestamp: "2024-08-28T00:48:53Z"
    generation: 2
    labels:
      app.kubernetes.io/managed-by: Helm
    name: default
    namespace: sriov-network-operator
    resourceVersion: "20124753"
    uid: a131844f-34c3-4a41-a739-b1c51ff145d3
  spec:
    configDaemonNodeSelector:
      feature.node.kubernetes.io/network-sriov.capable: "true"
    configurationMode: daemon
    disableDrain: true
    enableInjector: false
    enableOperatorWebhook: false
    featureGates:
      metricsExporter: true
    logLevel: 1
kind: List
metadata:
  resourceVersion: ""

If I modify the node selector so the value is something other than "true" then the error goes away:

configDaemonNodeSelector:
  kubernetes.io/hostname: host1

However, I notice the sriov-network-metrics-exporter pod fails to start with error:

Warning  FailedMount  52s (x11 over 7m3s)  kubelet            MountVolume.SetUp failed for volume "metrics-certs" : secret "metrics-exporter-cert" not found   

fyi @zeeke

ianb-mp commented 2 months ago

I realised the error about missing secret metrics-exporter-cert is due to the operator referencing that by default here. Is it mandatory to supply a certificate in this way? (I don't recall needing to do anything with certs when deploying metrics exporter using upstream repo's manifest directly)

zeeke commented 2 months ago

Hi @ianb-mp, ATM metrics are exported via a kube-rbac-proxy only through HTTPS. If you are interested in making this optional and having the metrics available through plain HTTP, I can bring this topic to the next community meeting.

Regarding the error:

                            json: cannot unmarshal bool into Go struct field PodSpec.spec.template.spec.nodeSelector of type string"}

I confirm it is a bug, will look for a fix