node-setup-daemon fails on karpenter nodes

marcustut commented 5 months ago

What happened?

I followed the guide to setup EKS but since I already have an existing clusters with karpenter, I didn't create a new cluster with the eks-cluster.yaml provided. After I deploy the operator, local-csi-driver I then deploy a ScyllaCluster manifest but the cluster-node-setup pod kept failing with this error

++ mktemp -d
+ cd /tmp/tmp.eE7xsS9yC8
++ find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n'
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./snap
+ mount --rbind /host/snap ./snap
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./home
+ mount --rbind /host/home ./home
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./sys
+ mount --rbind /host/sys ./sys
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./boot
+ mount --rbind /host/boot ./boot
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./root
+ mount --rbind /host/root ./root
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./lost+found
+ mount --rbind /host/lost+found ./lost+found
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./opt
+ mount --rbind /host/opt ./opt
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./media
+ mount --rbind /host/media ./media
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./mnt
+ mount --rbind /host/mnt ./mnt
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./usr
+ mount --rbind /host/usr ./usr
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./etc
+ mount --rbind /host/etc ./etc
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./dev
+ mount --rbind /host/dev ./dev
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./host
+ mount --rbind /host/host ./host
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./proc
+ mount --rbind /host/proc ./proc
+ for d in $( find /host -mindepth 1 -maxdepth 1 -type d -not -path /host/var -not -path /host/run -not -path /host/tmp -printf '%f\n' )
+ mkdir -p ./srv
+ mount --rbind /host/srv ./srv
++ find /host -mindepth 1 -maxdepth 1 -type f -printf '%f\n'
+ find /host -mindepth 1 -maxdepth 1 -type l -exec cp -P '{}' ./ ';'
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/udev ']'
+ mkdir -p ./run/udev
+ mount --rbind /host/run/udev ./run/udev
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/mdadm ']'
+ for dir in "run/udev" "run/mdadm" "run/dbus"
+ '[' -d /host/run/dbus ']'
+ mkdir -p ./run/dbus
+ mount --rbind /host/run/dbus ./run/dbus
+ for dir in "run/crio" "run/containerd"
+ '[' -d /host/run/crio ']'
+ for dir in "run/crio" "run/containerd"
+ '[' -d /host/run/containerd ']'
+ mkdir -p ./run/containerd
+ mount --rbind /host/run/containerd ./run/containerd
+ '[' -f /host/run/dockershim.sock ']'
+ '[' -d /host/var/lib/kubelet ']'
+ mkdir -p ./var/lib/kubelet
+ mount --rbind /host/var/lib/kubelet ./var/lib/kubelet
+ mkdir -p ./scylla-operator
+ touch ./scylla-operator/scylla-operator
+ mount --bind /usr/bin/scylla-operator ./scylla-operator/scylla-operator
+ mkdir -p ./run/secrets/kubernetes.io/serviceaccount
+ for f in ca.crt token
+ touch ./run/secrets/kubernetes.io/serviceaccount/ca.crt
+ mount --bind /run/secrets/kubernetes.io/serviceaccount/ca.crt ./run/secrets/kubernetes.io/serviceaccount/ca.crt
+ for f in ca.crt token
+ touch ./run/secrets/kubernetes.io/serviceaccount/token
+ mount --bind /run/secrets/kubernetes.io/serviceaccount/token ./run/secrets/kubernetes.io/serviceaccount/token
+ '[' -L /host/var/run ']'
+ mkdir -p ./var
+ ln -s ../run ./var/run
+ exec chroot ./ /scylla-operator/scylla-operator node-setup-daemon --namespace=scylla-operator-node-tuning --pod-name=cluster-node-setup-jhjpj --node-name=ip-192-168-14-249.ap-south-1.compute.internal --node-config-name=cluster --node-config-uid=a7e98cb5-acc5-41a2-af1f-c44b48ca9f03 --scylla-image=docker.io/scylladb/scylla:5.4.0 --disable-optimizations=false --loglevel=4
2024/03/16 14:02:13 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
I0316 14:02:13.164326       1 operator/nodesetupdaemon.go:172] node-setup-daemon version "v1.12.0-beta.0-26-gc181cf2"
I0316 14:02:13.164345       1 flag/flags.go:64] FLAG: --burst="75"
I0316 14:02:13.164350       1 flag/flags.go:64] FLAG: --cri-endpoint="[unix:///var/run/dockershim.sock,unix:///run/containerd/containerd.sock,unix:///run/crio/crio.sock]"
I0316 14:02:13.164359       1 flag/flags.go:64] FLAG: --disable-optimizations="false"
I0316 14:02:13.164363       1 flag/flags.go:64] FLAG: --feature-gates=""
I0316 14:02:13.164368       1 flag/flags.go:64] FLAG: --help="false"
I0316 14:02:13.164371       1 flag/flags.go:64] FLAG: --kubeconfig=""
I0316 14:02:13.164375       1 flag/flags.go:64] FLAG: --kubelet-pod-resources-endpoint="unix:///var/lib/kubelet/pod-resources/kubelet.sock"
I0316 14:02:13.164380       1 flag/flags.go:64] FLAG: --loglevel="4"
I0316 14:02:13.164387       1 flag/flags.go:64] FLAG: --namespace="scylla-operator-node-tuning"
I0316 14:02:13.164391       1 flag/flags.go:64] FLAG: --node-config-name="cluster"
I0316 14:02:13.164393       1 flag/flags.go:64] FLAG: --node-config-uid="a7e98cb5-acc5-41a2-af1f-c44b48ca9f03"
I0316 14:02:13.164397       1 flag/flags.go:64] FLAG: --node-name="ip-192-168-14-249.ap-south-1.compute.internal"
I0316 14:02:13.164415       1 flag/flags.go:64] FLAG: --pod-name="cluster-node-setup-jhjpj"
I0316 14:02:13.164418       1 flag/flags.go:64] FLAG: --qps="50"
I0316 14:02:13.164424       1 flag/flags.go:64] FLAG: --scylla-image="docker.io/scylladb/scylla:5.4.0"
I0316 14:02:13.164428       1 flag/flags.go:64] FLAG: --v="4"
I0316 14:02:13.164590       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/run/crio/crio.sock"
I0316 14:02:13.164628       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/var/run/dockershim.sock"
I0316 14:02:13.164738       1 cri/client.go:62] "Connecting to CRI endpoint" Scheme="unix" Path="/run/containerd/containerd.sock"
I0316 14:02:15.165396       1 cri/client.go:114] "Connected to CRI endpoint" Successful=["unix:///run/containerd/containerd.sock"] Other attempts="[unix:///var/run/dockershim.sock: context deadline exceeded, unix:///run/crio/crio.sock: context deadline exceeded]"
I0316 14:02:15.183283       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:54291->[::1]:53: read: connection refused"
I0316 14:02:15.194851       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:58179->[::1]:53: read: connection refused"
I0316 14:02:15.246331       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:51943->[::1]:53: read: connection refused"
I0316 14:02:15.500059       1 operator/nodesetupdaemon.go:213] "Can't get Node" Node="ip-192-168-14-249.ap-south-1.compute.internal" Error="Get \"https://B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-14-249.ap-south-1.compute.internal\": dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:45116->[::1]:53: read: connection refused"
Error: can't get node "ip-192-168-14-249.ap-south-1.compute.internal": timed out waiting for the condition

So apparently somehow it was failing at not able to call get node through the kubernetes api but the error is not regarding authentication but some tcp / udp connection refused.

What did you expect to happen?

The cluster-node-setup to succeed and XFS filesystem is created and scyllacluster continue to be created

How can we reproduce it (as minimally and precisely as possible)?

use Karpenter cluster with a nodepool using i4i instances and deploy the operator

Scylla Operator version

v1.12

Kubernetes platform name and version

```console $ kubectl version Client Version: v1.29.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.25.16-eks-77b1e4e WARNING: version difference between client (1.29) and server (1.25) exceeds the supported minor version skew of +/-1 ``` Kubernetes platform info:

Please attach the must-gather archive.

scylla-operator-must-gather-tc7c7mp7jnt6.zip

Anything else we need to know?

No response

tnozicka commented 5 months ago

Please attach the must-gather archive. n/a

we can't invest time into investigating your issue if you don't invest time to provide the required information

The must-gather archive is a **mandatory** part of every bug report.

https://github.com/scylladb/scylla-operator/blob/f356138/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57

zimnx commented 5 months ago

dial tcp: lookup B521A69C295412345512389.yl4.ap-south-1.eks.amazonaws.com on [::1]:53: read udp [::1]:45116->[::1]:53: read: connection refused"

Looks like firewall blocks DNS traffic from that node or your DNS service is down.

marcustut commented 5 months ago

Please attach the must-gather archive. n/a

we can't invest time into investigating your issue if you don't invest time to provide the required information
The must-gather archive is a **mandatory** part of every bug report.
https://github.com/scylladb/scylla-operator/blob/f356138/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57

Sorry, I'll get back to this later with the must-gather archive.

As for @zimnx, I did manually SSH-ed into the node and ran the API Call through curl, it worked fine. I also tried launching a debug pod with the same image docker.io/scylladb/scylla-operator:1.12 and ran the API call through curl and it worked fine too.

marcustut commented 5 months ago

@tnozicka @zimnx I have updated the issue with the must-gather logs

tnozicka commented 5 months ago

This is the log file I got from the must-gather program

you need to upload the folder it created, not the log from its creation

"Gathering artifacts" DestDir="scylla-operator-must-gather-tc7c7mp7jnt6"

marcustut commented 5 months ago

This is the log file I got from the must-gather program

you need to upload the folder it created, not the log from its creation

"Gathering artifacts" DestDir="scylla-operator-must-gather-tc7c7mp7jnt6"

Sorry, I updated the issue with the entire archive

zimnx commented 5 months ago

Archive is empty

marcustut commented 5 months ago

Archive is empty

Sorry, reuploaded it

zr-mah commented 5 months ago

I'm facing the exact same issue too. @zimnx May I know if you got a chance to look at it?

scylla-operator-bot[bot] commented 2 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

scylla-operator-bot[bot] commented 1 month ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out

/lifecycle rotten

scylla-operator-bot[bot] commented 5 days ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out

/close not-planned

scylla-operator-bot[bot] commented 5 days ago

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/scylladb/scylla-operator/issues/1839#issuecomment-2335148510): >The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. > >This bot triages un-triaged issues according to the following rules: >- After 30d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out > >/close not-planned Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

scylladb / scylla-operator