hpe-storage / csi-driver

A Container Storage Interface (CSI) driver from HPE
https://scod.hpedev.io
Apache License 2.0
55 stars 53 forks source link

Talos Support #379

Open evilhamsterman opened 4 months ago

evilhamsterman commented 4 months ago

Talos is becoming more popular but currently the csi-driver doesn't work with it. If we need to do manual configuration of things like the iscsi and multipath we can do that by pushing values/files in the machine config. But the biggest hitch to me appears to be the requirement to create and mount /etc/hpe-storage on the host. That works on CoreOS but does not on Talos because basically the whole system is RO.

From what I can see that mount is needed to store a unique id for the node, couldn't you use the already existing unique ide and store specific data in ConfigMaps.

datamattsson commented 4 months ago

Hey, thanks for the interest! We've been kicking this around for a bit and I filed an internal JIRA to move the identifier to the Kubernetes control-plane instead. I've had some heated conversations with Andrew from the Talos project and I'm not 100% sure moving the identifier to Kubernetes will solve all our problems.

If you are an existing HPE customer or a prospect, you should work with your account team and mention this requirement. That is the fastest route.

evilhamsterman commented 4 months ago

I don't think moving the ID to the control plane would solve all the problems, but it's a start. Maybe at least making it possible to set the /etc/hpe-storage mount path so we can specify Talos' ephemeral environment? It's possible with Kustomize but that's an extra step. I do plan on talking with our account rep but wanted to get it on the board here.

datamattsson commented 4 months ago

Internal JIRA is CON-1838.

mikart143 commented 1 month ago

Hi, are there any news about support for Talos?

datamattsson commented 1 month ago

It did not make it into the 2.5.0 release. I was going to do some research on it but it got delayed.

evilhamsterman commented 1 month ago

I'm glad to hear that it is actively being pursued at least. I will likely be deploying a new cluster in the relatively near future and it would be nice to be able to start with Talos

evilhamsterman commented 2 weeks ago

I try not to be the one pinging for updates all the time. But I need to start deploying a bare metal Kubernetes cluster soon and I'm in a bit of a planning pickle. I'd really like to just start with Talos but can't because of the need to use Nimble for PVs. I can start with a kubeadm cluster and later migrate to Talos, but that would mean putting a bunch of effort into setting up deployment workflows that may just be abandoned shortly after. So I'm not sure how much effort I should invest in automation vs just rolling by hand for now, or using an alternative storage.

I can understand 2.5 is out of the picture, it looks like there're already betas for that. So is this planned to be included in 2.6, which based on previous release cadence we may see before EOY or perhaps a 2.5.x release? Or is this planned for a longer timeframe like next year. Just trying to get an idea to help with planning.

datamattsson commented 2 weeks ago

It's hard for me to gauge when we can get to a stage to support Talos and immutable nodes in general. It's very high on my list but I rarely get my way when large deals are on the table demanding feature X, Y and Z.

Also, full disclosure, we have not even scoped the next minor or patch release as we're neck deep stabilizing 2.5.0. I'll make a note and try to get it in for consideration in the next couple of releases.

If you want to email me directly at michael.mattsson at hpe.com with your company name and business relationship with HPE it will make it easier for me to talk to product management.

datamattsson commented 2 weeks ago

I don't have a Talos environment readily available and skimming through the docs I realize I need firewall rules or deploy a new deployment environment for Talos itself.

As a quick hack, can you tell me how far you get with this?

helm repo add datamattsson https://datamattsson.github.io/co-deployments/
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos --set disableNodeConfiguration=true
evilhamsterman commented 2 weeks ago

It looks like it is still mounting /etc/hpe-storage and causing failures due to the RO filesystem

Node YAML ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: "2024-06-21T18:34:59Z" generateName: hpe-csi-node- labels: app: hpe-csi-node controller-revision-hash: 6cc9c89c6b pod-template-generation: "1" role: hpe-csi name: hpe-csi-node-tsvkz namespace: hpe-storage ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: DaemonSet name: hpe-csi-node uid: 280184d8-2211-44a8-9829-4d182242cb65 resourceVersion: "7099" uid: 29d72260-ce51-4e52-8050-f975d54eacbc spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchFields: - key: metadata.name operator: In values: - talos-nvj-4af containers: - args: - --csi-address=$(ADDRESS) - --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH) - --v=5 env: - name: ADDRESS value: /csi/csi.sock - name: DRIVER_REG_SOCK_PATH value: /var/lib/kubelet/plugins/csi.hpe.com/csi.sock - name: KUBE_NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.1 imagePullPolicy: IfNotPresent name: csi-node-driver-registrar resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /csi name: plugin-dir - mountPath: /registration name: registration-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-lw749 readOnly: true - args: - --endpoint=$(CSI_ENDPOINT) - --node-service - --flavor=kubernetes - --node-monitor - --node-monitor-interval=30 env: - name: CSI_ENDPOINT value: unix:///csi/csi.sock - name: LOG_LEVEL value: info - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: DISABLE_NODE_CONFIGURATION value: "true" - name: KUBELET_ROOT_DIR value: /var/lib/kubelet image: quay.io/hpestorage/csi-driver:v2.5.0-beta imagePullPolicy: IfNotPresent name: hpe-csi-driver resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi securityContext: allowPrivilegeEscalation: true capabilities: add: - SYS_ADMIN privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /csi name: plugin-dir - mountPath: /var/lib/kubelet mountPropagation: Bidirectional name: pods-mount-dir - mountPath: /host mountPropagation: Bidirectional name: root-dir - mountPath: /dev name: device-dir - mountPath: /var/log name: log-dir - mountPath: /etc/hpe-storage name: etc-hpe-storage-dir - mountPath: /etc/kubernetes name: etc-kubernetes - mountPath: /sys name: sys - mountPath: /run/systemd name: runsystemd - mountPath: /etc/systemd/system name: etcsystemd - mountPath: /opt/hpe-storage/nimbletune/config.json name: linux-config-file subPath: config.json - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-lw749 readOnly: true dnsConfig: options: - name: ndots value: "1" dnsPolicy: ClusterFirstWithHostNet enableServiceLinks: true hostNetwork: true initContainers: - args: - --node-init - --endpoint=$(CSI_ENDPOINT) - --flavor=kubernetes env: - name: CSI_ENDPOINT value: unix:///csi/csi.sock image: quay.io/hpestorage/csi-driver:v2.5.0-beta imagePullPolicy: IfNotPresent name: hpe-csi-node-init resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi securityContext: allowPrivilegeEscalation: true capabilities: add: - SYS_ADMIN privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host mountPropagation: Bidirectional name: root-dir - mountPath: /dev name: device-dir - mountPath: /sys name: sys - mountPath: /etc/hpe-storage name: etc-hpe-storage-dir - mountPath: /run/systemd name: runsystemd - mountPath: /etc/systemd/system name: etcsystemd - mountPath: /csi name: plugin-dir - mountPath: /var/lib/kubelet name: pods-mount-dir - mountPath: /var/log name: log-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-lw749 readOnly: true nodeName: talos-nvj-4af preemptionPolicy: PreemptLowerPriority priority: 2000001000 priorityClassName: system-node-critical restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: hpe-csi-node-sa serviceAccountName: hpe-csi-node-sa terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: csi.hpe.com/hpe-nfs operator: Exists - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists - effect: NoSchedule key: node.kubernetes.io/disk-pressure operator: Exists - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists - effect: NoSchedule key: node.kubernetes.io/pid-pressure operator: Exists - effect: NoSchedule key: node.kubernetes.io/unschedulable operator: Exists - effect: NoSchedule key: node.kubernetes.io/network-unavailable operator: Exists volumes: - hostPath: path: /var/lib/kubelet/plugins_registry type: Directory name: registration-dir - hostPath: path: /var/lib/kubelet/plugins/csi.hpe.com type: DirectoryOrCreate name: plugin-dir - hostPath: path: /var/lib/kubelet type: "" name: pods-mount-dir - hostPath: path: / type: "" name: root-dir - hostPath: path: /dev type: "" name: device-dir - hostPath: path: /var/log type: "" name: log-dir - hostPath: path: /etc/hpe-storage type: "" name: etc-hpe-storage-dir - hostPath: path: /etc/kubernetes type: "" name: etc-kubernetes - hostPath: path: /run/systemd type: "" name: runsystemd - hostPath: path: /etc/systemd/system type: "" name: etcsystemd - hostPath: path: /sys type: "" name: sys - configMap: defaultMode: 420 name: hpe-linux-config name: linux-config-file - name: kube-api-access-lw749 projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace status: conditions: - lastProbeTime: null lastTransitionTime: "2024-06-21T18:35:00Z" status: "True" type: PodReadyToStartContainers - lastProbeTime: null lastTransitionTime: "2024-06-21T18:34:59Z" message: 'containers with incomplete status: [hpe-csi-node-init]' reason: ContainersNotInitialized status: "False" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-06-21T18:34:59Z" message: 'containers with unready status: [csi-node-driver-registrar hpe-csi-driver]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2024-06-21T18:34:59Z" message: 'containers with unready status: [csi-node-driver-registrar hpe-csi-driver]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-06-21T18:34:59Z" status: "True" type: PodScheduled containerStatuses: - image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.1 imageID: "" lastState: {} name: csi-node-driver-registrar ready: false restartCount: 0 started: false state: waiting: reason: PodInitializing - image: quay.io/hpestorage/csi-driver:v2.5.0-beta imageID: "" lastState: {} name: hpe-csi-driver ready: false restartCount: 0 started: false state: waiting: reason: PodInitializing hostIP: 10.100.155.236 hostIPs: - ip: 10.100.155.236 initContainerStatuses: - image: quay.io/hpestorage/csi-driver:v2.5.0-beta imageID: "" lastState: {} name: hpe-csi-node-init ready: false restartCount: 0 started: false state: waiting: message: 'failed to generate container "d1bfa53cdae544c0b62c5d36c001fc2f7270357ac5bcf01691257ae999dbc058" spec: failed to generate spec: failed to mkdir "/etc/hpe-storage": mkdir /etc/hpe-storage: read-only file system' reason: CreateContainerError phase: Pending podIP: 10.100.155.236 podIPs: - ip: 10.100.155.236 qosClass: Burstable startTime: "2024-06-21T18:34:59Z" ```
Controller YAML ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: "2024-06-21T18:42:38Z" generateName: hpe-csi-controller-574bc6ccf9- labels: app: hpe-csi-controller pod-template-hash: 574bc6ccf9 role: hpe-csi name: hpe-csi-controller-574bc6ccf9-bzpb5 namespace: hpe-storage ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: hpe-csi-controller-574bc6ccf9 uid: 5b8c38be-808a-4293-b80a-b7780843bc8b resourceVersion: "7487" uid: 727b54dc-e0b6-4281-b6a4-4cbd297a592f spec: containers: - args: - --csi-address=$(ADDRESS) - --v=5 - --extra-create-metadata - --timeout=30s - --worker-threads=16 - --feature-gates=Topology=true - --immediate-topology=false env: - name: ADDRESS value: /var/lib/csi/sockets/pluginproxy/csi.sock image: registry.k8s.io/sig-storage/csi-provisioner:v4.0.1 imagePullPolicy: IfNotPresent name: csi-provisioner resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --v=5 - --csi-address=$(ADDRESS) env: - name: ADDRESS value: /var/lib/csi/sockets/pluginproxy/csi.sock image: registry.k8s.io/sig-storage/csi-attacher:v4.5.1 imagePullPolicy: IfNotPresent name: csi-attacher resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --v=5 - --csi-address=$(ADDRESS) env: - name: ADDRESS value: /var/lib/csi/sockets/pluginproxy/csi.sock image: registry.k8s.io/sig-storage/csi-snapshotter:v7.0.2 imagePullPolicy: IfNotPresent name: csi-snapshotter resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy/ name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --csi-address=$(ADDRESS) - --v=5 env: - name: ADDRESS value: /var/lib/csi/sockets/pluginproxy/csi.sock image: registry.k8s.io/sig-storage/csi-resizer:v1.10.1 imagePullPolicy: IfNotPresent name: csi-resizer resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --endpoint=$(CSI_ENDPOINT) - --flavor=kubernetes - --pod-monitor - --pod-monitor-interval=30 env: - name: CSI_ENDPOINT value: unix:///var/lib/csi/sockets/pluginproxy/csi.sock - name: LOG_LEVEL value: info image: quay.io/hpestorage/csi-driver:v2.5.0-beta imagePullPolicy: IfNotPresent name: hpe-csi-driver resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy name: socket-dir - mountPath: /var/log name: log-dir - mountPath: /etc/kubernetes name: k8s - mountPath: /etc/hpe-storage name: hpeconfig - mountPath: /host name: root-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --v=5 - --csi-address=$(ADDRESS) env: - name: ADDRESS value: /var/lib/csi/sockets/pluginproxy/csi-extensions.sock image: quay.io/hpestorage/volume-mutator:v1.3.6-beta imagePullPolicy: IfNotPresent name: csi-volume-mutator resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy/ name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --v=5 - --csi-address=$(ADDRESS) env: - name: ADDRESS value: /var/lib/csi/sockets/pluginproxy/csi-extensions.sock image: quay.io/hpestorage/volume-group-snapshotter:v1.0.6-beta imagePullPolicy: IfNotPresent name: csi-volume-group-snapshotter resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy/ name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --v=5 - --csi-address=$(ADDRESS) env: - name: ADDRESS value: /var/lib/csi/sockets/pluginproxy/csi-extensions.sock image: quay.io/hpestorage/volume-group-provisioner:v1.0.6-beta imagePullPolicy: IfNotPresent name: csi-volume-group-provisioner resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy/ name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true - args: - --v=5 - --endpoint=$(CSI_ENDPOINT) env: - name: CSI_ENDPOINT value: unix:///var/lib/csi/sockets/pluginproxy/csi-extensions.sock - name: LOG_LEVEL value: info image: quay.io/hpestorage/csi-extensions:v1.2.7-beta imagePullPolicy: IfNotPresent name: csi-extensions resources: limits: cpu: "2" memory: 1Gi requests: cpu: 100m memory: 128Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/csi/sockets/pluginproxy/ name: socket-dir - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-djh9g readOnly: true dnsConfig: options: - name: ndots value: "1" dnsPolicy: ClusterFirstWithHostNet enableServiceLinks: true hostNetwork: true nodeName: talos-nvj-4af preemptionPolicy: PreemptLowerPriority priority: 2000000000 priorityClassName: system-cluster-critical restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: hpe-csi-controller-sa serviceAccountName: hpe-csi-controller-sa terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - emptyDir: {} name: socket-dir - hostPath: path: /var/log type: "" name: log-dir - hostPath: path: /etc/kubernetes type: "" name: k8s - hostPath: path: /etc/hpe-storage type: "" name: hpeconfig - hostPath: path: / type: "" name: root-dir - name: kube-api-access-djh9g projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace status: conditions: - lastProbeTime: null lastTransitionTime: "2024-06-21T18:42:41Z" status: "True" type: PodReadyToStartContainers - lastProbeTime: null lastTransitionTime: "2024-06-21T18:42:39Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-06-21T18:42:39Z" message: 'containers with unready status: [hpe-csi-driver]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2024-06-21T18:42:39Z" message: 'containers with unready status: [hpe-csi-driver]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-06-21T18:42:39Z" status: "True" type: PodScheduled containerStatuses: - containerID: containerd://69d077a9414b9f622fffc550c68bf651c4ede0fc41ef85279347c363049f4f54 image: registry.k8s.io/sig-storage/csi-attacher:v4.5.1 imageID: registry.k8s.io/sig-storage/csi-attacher@sha256:9dcd469f02bbb7592ad61b0f848ec242f9ea2102187a0cd8407df33c2d633e9c lastState: terminated: containerID: containerd://dd726dccda8c6a3774e1e96060d9b1529dfebbed83667ea76e6fd85c0b995b0b exitCode: 1 finishedAt: "2024-06-21T18:43:41Z" reason: Error startedAt: "2024-06-21T18:43:10Z" name: csi-attacher ready: true restartCount: 2 started: true state: running: startedAt: "2024-06-21T18:43:56Z" - containerID: containerd://ab469a79652fd7d894e15f93528d2d92a03fa867c80f818c2575eee3ce530652 image: quay.io/hpestorage/csi-extensions:v1.2.7-beta imageID: quay.io/hpestorage/csi-extensions@sha256:106637da1dad32a0ffda17f3110f5d396cc6b03ed2af63b4c5260c8ed02b1314 lastState: {} name: csi-extensions ready: true restartCount: 0 started: true state: running: startedAt: "2024-06-21T18:42:40Z" - containerID: containerd://b0a981a751c5143813a4db0b53bb2c2243312136e543a4d65b952dc61b84f5c1 image: registry.k8s.io/sig-storage/csi-provisioner:v4.0.1 imageID: registry.k8s.io/sig-storage/csi-provisioner@sha256:bf5a235b67d8aea00f5b8ec24d384a2480e1017d5458d8a63b361e9eeb1608a9 lastState: terminated: containerID: containerd://443bb47ec3cb0e9093342377d75f7812422e7f62bf4d0ce5d22757c42052dc15 exitCode: 1 finishedAt: "2024-06-21T18:43:40Z" reason: Error startedAt: "2024-06-21T18:43:10Z" name: csi-provisioner ready: true restartCount: 2 started: true state: running: startedAt: "2024-06-21T18:43:56Z" - containerID: containerd://bbaafc9a87726f13ca40e7b7f1973e4473dd8ce94a78c9a67ce05b7205e88553 image: registry.k8s.io/sig-storage/csi-resizer:v1.10.1 imageID: registry.k8s.io/sig-storage/csi-resizer@sha256:4ecda2818f6d88a8f217babd459fdac31588f85581aa95ac7092bb0471ff8541 lastState: terminated: containerID: containerd://9d299091598a0a53213d7a92321f4ef5fc9fff1d1f88beba87b62fe35c7b7639 exitCode: 1 finishedAt: "2024-06-21T18:43:41Z" reason: Error startedAt: "2024-06-21T18:43:11Z" name: csi-resizer ready: true restartCount: 2 started: true state: running: startedAt: "2024-06-21T18:43:56Z" - containerID: containerd://fe753c0bc4d762a861c59f5d557c4152e6bf85bb5495fb336e3e8a8ce57bf5e4 image: registry.k8s.io/sig-storage/csi-snapshotter:v7.0.2 imageID: registry.k8s.io/sig-storage/csi-snapshotter@sha256:c4b6b02737bc24906fcce57fe6626d1a36cb2b91baa971af2a5e5a919093c34e lastState: terminated: containerID: containerd://ec7b4e064f648cfd70c882b81a601db820d1eaf483f30867bcaaf93347d26879 exitCode: 1 finishedAt: "2024-06-21T18:43:41Z" reason: Error startedAt: "2024-06-21T18:43:11Z" name: csi-snapshotter ready: true restartCount: 2 started: true state: running: startedAt: "2024-06-21T18:43:56Z" - containerID: containerd://1eb157a1a0fe1ebf3bc26ac8a6d7ee1a729fc1e1f7b04edc78664ea1294ceff0 image: quay.io/hpestorage/volume-group-provisioner:v1.0.6-beta imageID: quay.io/hpestorage/volume-group-provisioner@sha256:8d1ee0f752271148c019bc6ff2db53fdbfb56dfce3ede2e8f1549952becfeb05 lastState: {} name: csi-volume-group-provisioner ready: true restartCount: 0 started: true state: running: startedAt: "2024-06-21T18:42:40Z" - containerID: containerd://42fec0266f3669000d461c690cc2c0fd74e7d8a5c0f0093a5b591c82fc3b6612 image: quay.io/hpestorage/volume-group-snapshotter:v1.0.6-beta imageID: quay.io/hpestorage/volume-group-snapshotter@sha256:9be38de0f93f6b4ce7d0456eaabf5da3890b094a89a7b811852d31fbaf76c79c lastState: {} name: csi-volume-group-snapshotter ready: true restartCount: 0 started: true state: running: startedAt: "2024-06-21T18:42:40Z" - containerID: containerd://ae0dce20062d444aa8a124fe753bcc200c1b8008a3a4ef800e7b4500fc73b861 image: quay.io/hpestorage/volume-mutator:v1.3.6-beta imageID: quay.io/hpestorage/volume-mutator@sha256:247153bb789805c272b76fd8018ccd0f8bf4eabded5d4baf362d8a2c162b8672 lastState: {} name: csi-volume-mutator ready: true restartCount: 0 started: true state: running: startedAt: "2024-06-21T18:42:40Z" - image: quay.io/hpestorage/csi-driver:v2.5.0-beta imageID: "" lastState: {} name: hpe-csi-driver ready: false restartCount: 0 started: false state: waiting: message: 'failed to generate container "ee110797b0f68f31aa64c448b04f663590359bc4181a08be4f764f4dd599941f" spec: failed to generate spec: failed to mkdir "/etc/hpe-storage": mkdir /etc/hpe-storage: read-only file system' reason: CreateContainerError hostIP: 10.100.155.236 hostIPs: - ip: 10.100.155.236 phase: Pending podIP: 10.100.155.236 podIPs: - ip: 10.100.155.236 qosClass: Burstable startTime: "2024-06-21T18:42:39Z" ```
datamattsson commented 2 weeks ago

Ok, I had a brain fart, try now.

helm uninstall my-hpe-csi-driver -nhpe-storage
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos2 --set disableNodeConfiguration=true
evilhamsterman commented 2 weeks ago

Getting closer, the controller started fine but the hpe-csi-node daemonset pod is still trying to mount /etc/systemd/system

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m2s                default-scheduler  Successfully assigned hpe-storage/hpe-csi-node-qv9xk to talos-nvj-4af
  Warning  Failed     2m2s                kubelet            Error: failed to generate container "28b5218a6cea8f05806ec4210312762aa45cc1a851befe51d3e231bb6ff95fa2" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     2m2s                kubelet            Error: failed to generate container "5f4e20edc65d2a0990d99c0b5da15cf61f3c0273d577f1bacacbbcc49bf77ff5" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     108s                kubelet            Error: failed to generate container "1a59382b30bcc28fca08f6b48cf9ccce5adee2d003634ab00a59c9d470ad0a3c" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     97s                 kubelet            Error: failed to generate container "bdcdb9ac2dac778320a6f1fccfa7e0198ceb9f62cce3ab03ca59b7f061442133" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     85s                 kubelet            Error: failed to generate container "97701cc024c101137235529d83b03f1461e1dd97e48c543ac5d72474362e739d" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     74s                 kubelet            Error: failed to generate container "3176d754668a42fc845d93ef4ca8b116bd59f67ec35983626e9901f70099b219" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     61s                 kubelet            Error: failed to generate container "e0d1cee086f4f574cf0e9eee92da6ba94dbaa359990e92068ec6926dd8e16d03" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     46s                 kubelet            Error: failed to generate container "f6ab37bf1edc712984ff69f9f5da848a5eb6e4cf1bec0efa8cc697cc4f776e8b" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Warning  Failed     34s                 kubelet            Error: failed to generate container "c641f39c98980fccbac986b9c4bf7d35b2b226fc70fc12e71c54dc50b672bd77" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
  Normal   Pulled     7s (x11 over 2m2s)  kubelet            Container image "quay.io/hpestorage/csi-driver:v2.5.0-beta" already present on machine
  Warning  Failed     7s (x2 over 21s)    kubelet            (combined from similar events): Error: failed to generate container "6d45577bdd7ca1971a3eba9b3c110ea41001ed5d08cdfb91792fac458da31a37" spec: failed to generate spec: failed to mkdir "/etc/systemd/system": mkdir /etc/systemd: read-only file system
evilhamsterman commented 2 weeks ago

I did ensure disableNodeConfiguration is set

❯ helm get values my-hpe-csi-driver
USER-SUPPLIED VALUES:
disableNodeConfiguration: true
datamattsson commented 2 weeks ago

Ok, I here's the next one. 2.5.0-talos3

helm uninstall my-hpe-csi-driver -nhpe-storage
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos3 --set disableNodeConfiguration=true
evilhamsterman commented 2 weeks ago

The pod starts but the initContainer immediately crashes

hpe-csi-node-init + '[' --endpoint=unix:///csi/csi.sock = --node-init ']'
hpe-csi-node-init + for arg in "$@"
hpe-csi-node-init + '[' --flavor=kubernetes = --node-service ']'
hpe-csi-node-init + '[' --flavor=kubernetes = --node-init ']'
hpe-csi-node-init + disableNodeConformance=
hpe-csi-node-init + disableNodeConfiguration=
hpe-csi-node-init + '[' true = true ']'
hpe-csi-node-init + '[' '' = true ']'
hpe-csi-node-init + '[' '' = true ']'
hpe-csi-node-init + '[' '' '!=' true ']'
hpe-csi-node-init + cp -f /opt/hpe-storage/lib/hpe-storage-node.service /etc/systemd/system/hpe-storage-node.service
hpe-csi-node-init + cp -f /opt/hpe-storage/lib/hpe-storage-node.sh /etc/hpe-storage/hpe-storage-node.sh
hpe-csi-node-init cp: cannot create regular file '/etc/hpe-storage/hpe-storage-node.sh': No such file or directory
evilhamsterman commented 2 weeks ago

I looks like the DISABLE_NODE_CONFIGURATION environment variable is not getting set on the initContainer

spec:
  initContainers:
  - args:
    - --node-init
    - --endpoint=$(CSI_ENDPOINT)
    - --flavor=kubernetes
    env:
    - name: CSI_ENDPOINT
      value: unix:///csi/csi.sock
    image: quay.io/hpestorage/csi-driver:v2.5.0-beta
    imagePullPolicy: IfNotPresent
    name: hpe-csi-node-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - SYS_ADMIN
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host
      mountPropagation: Bidirectional
      name: root-dir
    - mountPath: /dev
      name: device-dir
    - mountPath: /sys
      name: sys
    - mountPath: /run/systemd
      name: runsystemd
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /var/lib/kubelet
      name: pods-mount-dir
    - mountPath: /var/log
      name: log-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xsr7w
      readOnly: true
datamattsson commented 2 weeks ago

This is very interesting, I think you just uncovered a different bug altogether. =)

datamattsson commented 2 weeks ago

Ok, talos4 has been published.

helm uninstall my-hpe-csi-driver -nhpe-storage
helm repo update
helm install my-hpe-csi-driver -nhpe-storage datamattsson/hpe-csi-driver --version 2.5.0-talos4 --set disableNodeConfiguration=true
evilhamsterman commented 2 weeks ago

I edited the DS to add the environment variable and used your latest update. The initContainer succeeds now but then I think we get to meat of the situation the csi-node-driver-registrar starts crashing and the hpe-csi-driver container complains it can't find initiators. It looks like part of the problem is on the Talos Side their iscsi-tools extension doesn't appear to include the multipath command https://github.com/siderolabs/extensions/issues/134. Though democratic-csi claims that it's not needed I'm not an expert in iSCSI so I can't say how true https://github.com/democratic-csi/democratic-csi/pull/225#issuecomment-1478699681

Container logs ``` hpe-csi-driver + '[' --endpoint=unix:///csi/csi.sock = --node-service ']' hpe-csi-driver + '[' --endpoint=unix:///csi/csi.sock = --node-init ']' csi-node-driver-registrar I0621 21:31:10.477623 1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock" csi-node-driver-registrar I0621 21:31:10.477723 1 connection.go:215] Connecting to unix:///csi/csi.sock csi-node-driver-registrar I0621 21:31:10.478469 1 main.go:164] Calling CSI driver to discover driver name hpe-csi-driver + for arg in "$@" hpe-csi-driver + '[' --node-service = --node-service ']' hpe-csi-driver + nodeService=true hpe-csi-driver + '[' --node-service = --node-init ']' hpe-csi-driver + for arg in "$@" hpe-csi-node-init + for arg in "$@" hpe-csi-node-init + '[' --endpoint=unix:///csi/csi.sock = --node-service ']' hpe-csi-node-init + '[' --endpoint=unix:///csi/csi.sock = --node-init ']' hpe-csi-node-init + for arg in "$@" hpe-csi-node-init + '[' --flavor=kubernetes = --node-service ']' hpe-csi-node-init + '[' --flavor=kubernetes = --node-init ']' hpe-csi-node-init + disableNodeConformance= hpe-csi-node-init + disableNodeConfiguration=true hpe-csi-node-init + '[' true = true ']' hpe-csi-node-init + '[' '' = true ']' csi-node-driver-registrar I0621 21:31:10.478557 1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo csi-node-driver-registrar I0621 21:31:10.478586 1 connection.go:245] GRPC request: {} csi-node-driver-registrar I0621 21:31:10.481052 1 connection.go:251] GRPC response: {"name":"csi.hpe.com","vendor_version":"1.3"} csi-node-driver-registrar I0621 21:31:10.481063 1 connection.go:252] GRPC error: hpe-csi-node-init + '[' true = true ']' hpe-csi-node-init + echo 'Node configuration is disabled' hpe-csi-node-init + disableConformanceCheck=true hpe-csi-node-init + '[' true '!=' true ']' hpe-csi-node-init + exec /bin/csi-driver --node-init --endpoint=unix:///csi/csi.sock --flavor=kubernetes hpe-csi-node-init Node configuration is disabled hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="Initialized logging." alsoLogToStderr=true logFileLocation=/var/log/hpe-csi-controller.log logLevel=info hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="**********************************************" file="csi-driver.go:56" hpe-csi-driver + '[' --flavor=kubernetes = --node-service ']' hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="*************** HPE CSI DRIVER ***************" file="csi-driver.go:57" hpe-csi-driver + '[' --flavor=kubernetes = --node-init ']' hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg="**********************************************" file="csi-driver.go:58" hpe-csi-driver + for arg in "$@" hpe-csi-node-init time="2024-06-21T21:31:06Z" level=info msg=">>>>> CMDLINE Exec, args: ]" file="csi-driver.go:60" hpe-csi-driver + '[' --node-monitor = --node-service ']' hpe-csi-node-init W0621 21:31:06.910459 1 reflector.go:424] hpe-csi-driver/pkg/flavor/kubernetes/flavor.go:145: failed to list *v1.VolumeSnapshot: volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:hpe-storage:hpe-csi-node-sa" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" at the cluster scope hpe-csi-driver + '[' --node-monitor = --node-init ']' hpe-csi-driver + for arg in "$@" hpe-csi-driver + '[' --node-monitor-interval=30 = --node-service ']' hpe-csi-driver + '[' --node-monitor-interval=30 = --node-init ']' hpe-csi-driver + disableNodeConformance= hpe-csi-driver + disableNodeConfiguration=true hpe-csi-driver + '[' '' = true ']' hpe-csi-driver + '[' true = true ']' hpe-csi-driver + echo 'copying hpe log collector diag script' hpe-csi-driver copying hpe log collector diag script hpe-csi-driver + cp -f /opt/hpe-storage/bin/hpe-logcollector.sh /usr/local/bin/hpe-logcollector.sh hpe-csi-driver + chmod +x /usr/local/bin/hpe-logcollector.sh hpe-csi-driver + '[' '!' -f /host/etc/multipath.conf ']' hpe-csi-driver + '[' true '!=' true ']' hpe-csi-driver + ln -s /host/etc/multipath.conf /etc/multipath.conf hpe-csi-driver + ln -s /host/etc/multipath /etc/multipath hpe-csi-driver + ln -s /host/etc/iscsi /etc/iscsi hpe-csi-driver + '[' -f /host/etc/redhat-release ']' hpe-csi-driver + '[' -f /host/etc/os-release ']' hpe-csi-driver + rm /etc/os-release csi-node-driver-registrar I0621 21:31:10.481070 1 main.go:173] CSI driver name: "csi.hpe.com" csi-node-driver-registrar I0621 21:31:10.481110 1 node_register.go:55] Starting Registration Server at: /registration/csi.hpe.com-reg.sock csi-node-driver-registrar I0621 21:31:10.481557 1 node_register.go:64] Registration Server started at: /registration/csi.hpe.com-reg.sock csi-node-driver-registrar I0621 21:31:10.481696 1 node_register.go:88] Skipping HTTP server because endpoint is set to: "" csi-node-driver-registrar I0621 21:31:11.759238 1 main.go:90] Received GetInfo call: &InfoRequest{} csi-node-driver-registrar I0621 21:31:11.777891 1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = Failed to get initiators for host,} csi-node-driver-registrar E0621 21:31:11.778006 1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = Failed to get initiators for host, restarting registration container. hpe-csi-node-init E0621 21:31:06.910696 1 reflector.go:140] hpe-csi-driver/pkg/flavor/kubernetes/flavor.go:145: Failed to watch *v1.VolumeSnapshot: failed to list *v1.VolumeSnapshot: volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:hpe-storage:hpe-csi-node-sa" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" at the cluster scope hpe-csi-node-init E0621 21:31:06.910770 1 reflector.go:140] hpe-csi-driver/pkg/flavor/kubernetes/flavor.go:127: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims) hpe-csi-node-init time="2024-06-21T21:31:06Z" level=error msg="process with pid : 11 finished with error = exit status 127" file="cmd.go:63" hpe-csi-node-init time="2024-06-21T21:31:06Z" level=error msg="Error while getting the multipath devices on the node " file="utils.go:11" hpe-csi-driver + ln -s /host/etc/os-release /etc/os-release hpe-csi-driver + echo 'starting csi plugin...' hpe-csi-driver + exec /bin/csi-driver --endpoint=unix:///csi/csi.sock --node-service --flavor=kubernetes --node-monitor --node-monitor-interval=30 hpe-csi-driver starting csi plugin... hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Initialized logging." alsoLogToStderr=true logFileLocation=/var/log/hpe-csi-node.log logLevel=info hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="**********************************************" file="csi-driver.go:56" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="*************** HPE CSI DRIVER ***************" file="csi-driver.go:57" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="**********************************************" file="csi-driver.go:58" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg=">>>>> CMDLINE Exec, args: ]" file="csi-driver.go:60" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Node configuration is disabled, DISABLE_NODE_CONFIGURATION=true.Skipping the Multipath and ISCSI configurations" file="csi-driver.go:142" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="NODE MONITOR: &{flavor:0xc0001f4d10 intervalSec:30 lock:{state:0 sema:0} started:false stopChannel: done: nodeName:talos-nvj-4af}" file="nodemonitor.go:26" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: CREATE_DELETE_VOLUME" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: LIST_VOLUMES" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: CREATE_DELETE_SNAPSHOT" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: LIST_SNAPSHOTS" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: CLONE_VOLUME" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: PUBLISH_READONLY" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling controller service capability: EXPAND_VOLUME" file="driver.go:250" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling node service capability: STAGE_UNSTAGE_VOLUME" file="driver.go:267" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling node service capability: EXPAND_VOLUME" file="driver.go:267" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling node service capability: GET_VOLUME_STATS" file="driver.go:267" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume expansion type: ONLINE" file="driver.go:281" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: SINGLE_NODE_WRITER" file="driver.go:293" Stream closed EOF for hpe-storage/hpe-csi-node-pn769 (csi-node-driver-registrar) hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: SINGLE_NODE_READER_ONLY" file="driver.go:293" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: MULTI_NODE_READER_ONLY" file="driver.go:293" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: MULTI_NODE_SINGLE_WRITER" file="driver.go:293" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Enabling volume access mode: MULTI_NODE_MULTI_WRITER" file="driver.go:293" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="DB service disabled!!!" file="driver.go:145" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="About to start the CSI driver 'csi.hpe.com with KubeletRootDirectory /var/lib/kubelet/'" file="csi-driver.go:186" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="[1] reply : [/bin/csi-driver --endpoint=unix:///csi/csi.sock --node-service --flavor=kubernetes --node-monitor --node-monitor-interval=30]" file="csi-driver.go:189" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Listening for connections on address: &net.UnixAddr{Name:\"//csi/csi.sock\", Net:\"unix\"}" file="server.go:86" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="Scheduled ephemeral inline volumes scrubber task to run every 3600 seconds, PodsDirPath: [/var/lib/kubelet/pods]" file="driver.go:214" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg=">>>>> Scrubber task invoked at 2024-06-21 21:31:07.639939957 +0000 UTC m=+0.038113292" file="driver.go:746" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="No ephemeral inline volumes found" file="driver.go:815" hpe-csi-driver time="2024-06-21T21:31:07Z" level=info msg="<<<<< Scrubber task completed at 2024-06-21 21:31:07.644313576 +0000 UTC m=+0.042486921" file="driver.go:751" hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="GRPC call: /csi.v1.Identity/GetPluginInfo" file="utils.go:69" hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="GRPC request: {}" file="utils.go:70" hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg=">>>>> GetPluginInfo" file="identity_server.go:16" hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="<<<<< GetPluginInfo" file="identity_server.go:19" hpe-csi-driver time="2024-06-21T21:31:08Z" level=info msg="GRPC response: {\"name\":\"csi.hpe.com\",\"vendor_version\":\"1.3\"}" file="utils.go:75" hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetInfo" file="utils.go:69" hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="GRPC request: {}" file="utils.go:70" hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="Writing uuid to file:/etc/hpe-storage/node.gob uuid:fb4b2815-d7b3-8e09-bf95-39eb01fb29ed" file="chapidriver_linux.go:52" hpe-csi-driver time="2024-06-21T21:31:09Z" level=error msg="process with pid : 20 finished with error = exit status 127" file="cmd.go:63" hpe-csi-driver time="2024-06-21T21:31:09Z" level=info msg="Host name reported as talos-nvj-4af" file="node_server.go:2087" hpe-csi-driver time="2024-06-21T21:31:09Z" level=warning msg="no fc adapters found on the host" file="fc.go:49" hpe-csi-driver time="2024-06-21T21:31:09Z" level=error msg="Failed to get initiators for host talos-nvj-4af. Error: iscsi and fc initiators not found" file="node_server.go:2091" hpe-csi-driver time="2024-06-21T21:31:09Z" level=error msg="GRPC error: rpc error: code = Internal desc = Failed to get initiators for host" file="utils.go:73" hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="GRPC call: /csi.v1.Identity/GetPluginInfo" file="utils.go:69" hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="GRPC request: {}" file="utils.go:70" hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg=">>>>> GetPluginInfo" file="identity_server.go:16" hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="<<<<< GetPluginInfo" file="identity_server.go:19" hpe-csi-driver time="2024-06-21T21:31:10Z" level=info msg="GRPC response: {\"name\":\"csi.hpe.com\",\"vendor_version\":\"1.3\"}" file="utils.go:75" hpe-csi-driver time="2024-06-21T21:31:11Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetInfo" file="utils.go:69" hpe-csi-driver time="2024-06-21T21:31:11Z" level=info msg="GRPC request: {}" file="utils.go:70" hpe-csi-driver time="2024-06-21T21:31:11Z" level=info msg="Host name reported as talos-nvj-4af" file="node_server.go:2087" hpe-csi-driver time="2024-06-21T21:31:11Z" level=warning msg="no fc adapters found on the host" file="fc.go:49" hpe-csi-driver time="2024-06-21T21:31:11Z" level=error msg="process with pid : 23 finished with error = exit status 127" file="cmd.go:63" hpe-csi-driver time="2024-06-21T21:31:11Z" level=error msg="Failed to get initiators for host talos-nvj-4af. Error: iscsi and fc initiators not found" file="node_server.go:2091" hpe-csi-driver time="2024-06-21T21:31:11Z" level=error msg="GRPC error: rpc error: code = Internal desc = Failed to get initiators for host" file="utils.go:73" Stream closed EOF for hpe-storage/hpe-csi-node-pn769 (hpe-csi-node-init) ```
evilhamsterman commented 2 weeks ago

Not sure how much help it is, but looking at your code it looks like perhaps the main issue is you're looking for the /etc/iscsi/initiatorname.iscsi file but that file doesn't exist in the normal placed in their system. Their extension bind mounts /usr/local/etc/iscsi/iscsid.con into the extension container at /etc/iscsi/iscsid.conf https://github.com/siderolabs/extensions/blob/f0b6082466dc78a309d1e9a7d8525497d714d4d4/storage/iscsi-tools/iscsid.yaml#L52C5-L53C42 but it doesn't mount the rest of the iSCSI folder so the initiator name is not accessible to you.

Looks to me like they need to mount the full /usr/local/etc/iscsi directory so that your driver can access that file, I assume that's how you get the imitator to register with the storage.

evilhamsterman commented 2 weeks ago

EUREKA! I found it they do mount the /etc/iscsi directory into /system/iscsi on the host, I shelled into the hpe-csi-node/hpe-csi-driver container and changed the link from /etc/iscsi -> /host/etc/iscsi to /host/system/iscsi when the registrar next restarted the driver container was able to find the initiator name and everything is now running.

❯ k get pods
NAME                                  READY   STATUS    RESTARTS        AGE
hpe-csi-controller-8447c48d9f-rjd49   9/9     Running   0               22m
hpe-csi-node-5t69x                    2/2     Running   9 (5m45s ago)   22m
nimble-csp-74776998b6-fmcn2           1/1     Running   0               22m
primera3par-csp-58dd48cccb-lvvjb      1/1     Running   0               22m

obviously that will break when that pod restarts. But I then created a storage class and a PVC and it worked right away

❯ k get pvc
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     VOLUMEATTRIBUTESCLASS   AGE
my-first-pvc   Bound    pvc-dc881628-ffe4-42c9-951e-e266502dd226   32Gi       RWO            csq-it-nimble1   <unset>                 2m5s

and I can see the volume on the array.

The last step mounting it is the one remaining issue. It does not successfully mount the volume. That appears to actually be related to the multipath I mentioned. I'm signing off for the weekend I'll look more on Monday

datamattsson commented 2 weeks ago

I should've researched this but is /usr/local/etc writable in Talos? (or, what directory IS writable and/or persistent on Talos?) I'm thinking we could just add a Helm chart parameter for CSI driver users to relocate /etc to whatever directory on the node.

As for commands the CSI driver needs to have availabe, look for clues here: https://github.com/hpe-storage/csi-driver/blob/master/Dockerfile

As for the multipath issue is that the HPE CSI Driver require multipath/multipathd on the host, there's no workaround as we don't even consider non-multipath entries.

I'm out of pocket of the rest of the weekend as well, cheers!

evilhamsterman commented 1 week ago

I've exhausted the time I can work on this for now. But this is what I found messing around some more. Hopefully it can help you get on the correct path, but it certainly looks like it's going to require more work than just changing the mount location. It does give me a little better idea on my planning though, I'll probably need to plan on a longer time for support.

It looks like /system is supposed to be the location for "persistent" data but it appears they mean persistent for the extension container lifecycle. The data is persistent if you restart the extension but it is not persistent over reboots. The path /system/state which contains the node config is persistent, and /var which is used as the storage for container images is persistent across reboots but I believe is not guaranteed.

However because the extensions are not persistent across reboots things like the Initiator name are not consistent, a new one is generated on every boot. Because of this I don't think it's a good idea to try and persist your node ID on disk like we discussed earlier. Either that should be generated dynamically or use the Kubernetes Node ID and store extra persistent data in a configmap or crd. In my opinion this is more in line with the general idea of Kubernetes anyway and cattle vs pets workflows.

Overall I see two maybe three major problems. One will require changes from Talos, the other will require work on your driver

  1. Their iscsi-tools extension doesn't include multipath support, as I mentioned above. I've commented on their ticket hopefully we can get some attention from them.
  2. Because the iscsi-tools runs as an OS level container it also has a very limited subset of tools. I was able to get iscsiadm to work by changing the chroot script to use nsenter instead, though maybe it would work without using env.

    #!/bin/bash
    
    iscsi_pid=$(pgrep -f "iscsid -f")
    
    nsenter --mount="/proc/$iscsi_pid/ns/mnt" --net="/proc/$iscsi_pid/ns/net" -- /usr/local/sbin/iscsiadm "${@:1}"
  3. You may have to use your own binaries for some operations and/or limit FS support to just XFS with Talos. The host system has binaries for xfs and vfat, but no mount or ext4/btrfs binaries. Here's all the binaries available on the host, note the iscsi one come from the iscsi-tools. Here's a dump of all the available binaries on the host, this includes the ones from the iscsi-tools extension
    
    / # find /host -name "*bin" -type d 2>/dev/null | grep -v var | grep -v container | xargs ls
    /host/bin:
    containerd               containerd-shim-runc-v2
    containerd-shim          runc

/host/opt/cni/bin: bandwidth firewall ipvlan ptp tuning bridge flannel loopback sbr vlan dhcp host-device macvlan static vrf dummy host-local portmap tap

/host/sbin: blkdeactivate lvm udevadm dashboard lvm_import_vdo udevd dmsetup lvmconfig vgcfgbackup dmstats lvmdevices vgcfgrestore dmstats.static lvmdiskscan vgchange fsadm lvmdump vgck fsck.xfs lvmsadc vgconvert init lvmsar vgcreate ip6tables lvreduce vgdisplay ip6tables-apply lvremove vgexport ip6tables-legacy lvrename vgextend ip6tables-legacy-restore lvresize vgimport ip6tables-legacy-save lvs vgimportclone ip6tables-restore lvscan vgimportdevices ip6tables-save mkfs.xfs vgmerge iptables modprobe vgmknodes iptables-apply poweroff vgreduce iptables-legacy pvchange vgremove iptables-legacy-restore pvck vgrename iptables-legacy-save pvcreate vgs iptables-restore pvdisplay vgscan iptables-save pvmove vgsplit lvchange pvremove wrapperd lvconvert pvresize xfs_repair lvcreate pvs xtables-legacy-multi lvdisplay pvscan lvextend shutdown

/host/usr/bin: udevadm

/host/usr/local/bin:

/host/usr/local/sbin: brcm_iscsiuio iscsi_offload iscsiuio iscsi-gen-initiatorname iscsiadm tgtadm iscsi-iname iscsid tgtd iscsi_discovery iscsid-wrapper tgtimg iscsi_fw_login iscsistart

/host/usr/sbin: cryptsetup mkfs.fat xfs_freeze xfs_ncheck dosfsck mkfs.msdos xfs_fsr xfs_quota dosfslabel mkfs.vfat xfs_growfs xfs_rtcp fatlabel veritysetup xfs_info xfs_scrub fsck.fat xfs_admin xfs_io xfs_scrub_all fsck.msdos xfs_bmap xfs_logprint xfs_spaceman fsck.vfat xfs_copy xfs_mdrestore integritysetup xfs_db xfs_metadump mkdosfs xfs_estimate xfs_mkfile

datamattsson commented 1 week ago

Thanks for the additional context. This definitely needs more work. I'm just puzzled how we can't even persist an IQN on the host though? Do we need to grab the first boot one and store in our CRD and regenerate the host IQN from that?

I guess FC wouldn't have as many problems but we still would need multipath/multipathd regardless. Not having ext4 available will also create problems for our NFS server implementation for RWX claims that doesn't play nicely with XFS in failure scenarios.

evilhamsterman commented 1 week ago

Thanks for the additional context. This definitely needs more work. I'm just puzzled how we can't even persist an IQN on the host though? Do we need to grab the first boot one and store in our CRD and regenerate the host IQN from that? It doesn't look like you can manage the IQN, their service generates one itself.

Just my thoughts I can think of two ways to deal with it

  1. Don't care about it. Have the controller only add an IQN to the array when needed and remove when not needed. So for example you are running a DB on a node with a PV but then you drain the node so the DB gets rescheduled on a different node and the old node is no longer needed so the controller removes it from the array. The controller would also keep track of known initiators and occasionally check to see if they are still live in the Kubernetes cluster and remove from the array if not to catch cases where nodes disappear. This would be the cattle/pets option
  2. Use the new ExtensionServiceConfig to specify IQNs, https://www.talos.dev/v1.7/reference/configuration/extensions/extensionserviceconfig/. This would require that Talos add support for it and administrators would have to generate IQNs for their systems which could be error prone.

I guess FC wouldn't have as many problems but we still would need multipath/multipathd regardless. Not having ext4 available will also create problems for our NFS server implementation for RWX claims that doesn't play nicely with XFS in failure scenarios.

Looking around at other CSI iSCSI implementations it looks like many of them use their own mkfs and mount binaries rather than rely on the host