hetznercloud / csi-driver

Kubernetes Container Storage Interface driver for Hetzner Cloud Volumes
MIT License
617 stars 102 forks source link

Daemonset crashloopback in openshift #404

Open itmwiw opened 1 year ago

itmwiw commented 1 year ago

Hello, I have an Openshift Cluster and I try to use hetznercloud csi-drive. However, all daemonset's pods are in CrashLoopBackOff state. Here's the logs:

[pod/hcloud-csi-node-45xqq/hcloud-csi-driver] level=error ts=2023-04-11T14:33:12.085976239Z msg="failed to fetch server ID from metadata service" err="Get \"http://169.254.169.254/hetzner/v1/metadata/instance-id\": dial tcp 169.254.169.254:80: connect: connection refused"

I guess this is related to what is described in here https://github.com/hetznercloud/csi-driver/issues/143. This issue was closed because version 1.6.0 attempts to use the environment variable HCLOUD_SERVER_ID or KUBE_NODE_NAME with a call to HCloudClient before falling back to the MetadataClient. However v2.2.0 doesn't do that anymore, so I guess the issue is back. Can you help me on this? Regards, Tarik

apricote commented 1 year ago

Hey, this was changed in #269, so we can remove access to the Hetzner Cloud API from the daemon set. We would prefer to keep the daemon set ("node" binary) as small as possible, so adding back access to the API is not what we want.

@samcday Do you have an idea how we can solve this for OpenShift where access to the metadata service is blocked?

apricote commented 1 year ago

Oh, forgot to mention. The Server ID and Location, which are the two fields retrieved from the Metadata Service are used in the response to NodeGetInfo: https://github.com/hetznercloud/csi-driver/blob/cbb7750af17224e256fcb62da5358a9743080a9f/driver/node.go#L194-L205

samcday commented 1 year ago

Hm. Tricky one. My original hope was to use k8s Node metadata as source of truth for this, thus tying csi-driver to hccm. But of course that violates the CSI abstraction and won't work for other container orchestrators.

Ultimately, the only way for us to determine this information from a particular node, without assuming any access to a control plane / orchestrator API of any kind, means we can only fetch this information from the metadata service, or fallback to statically provided information.

... Or we just add back the HCLOUD_TOKEN requirement for the node binary, so that it can fetch this info from the API. That would be a bummer from a purist technical point of view, but maybe it's the only way we can keep the CSI driver running reliably (and reasonably ergonomically!) across multiple orchestrators.

samcday commented 1 year ago

One other somewhat hacky idea: we could do the metadata API lookup in a small initContainer that uses hostNetwork: true and then pass that information along to the main (not host-networking) process.

apricote commented 1 year ago

One other somewhat hacky idea: we could do the metadata API lookup in a small initContainer that uses hostNetwork: true and then pass that information along to the main (not host-networking) process.

Perhaps this is something that can be done only for Openshift through the Helm Chart?

samcday commented 1 year ago

Perhaps this is something that can be done only for Openshift through the Helm Chart?

Yes, that sounds good :+1: Or even more generally: just a thing that you can opt into through values.yaml: helm install csi-driver --set initMetadataLookup=true or somesuch.


That said, it might just be better to always do it that way and keep the number of different deployment modes to a minimum. With such an approach, the node binary could remove all notion of HC API or metadata service, and require that all necessary metadata/topology info is injected through env. Some of this env comes from downward API, the rest comes from this proposed init container.

alrf commented 1 year ago

I have the same issue in Openshift.

alrf commented 1 year ago

I solved it in v2.3.2 using Topology=false here: https://github.com/hetznercloud/csi-driver/blob/dfe6183f4d0fddeefdff8069b1c09eeb38113b33/deploy/kubernetes/hcloud-csi.yml#L225 and added hostNetwork: true in DaemonSet on line 298: https://github.com/hetznercloud/csi-driver/blob/dfe6183f4d0fddeefdff8069b1c09eeb38113b33/deploy/kubernetes/hcloud-csi.yml#L298

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.