Hetzner arm nodes not joining cluster consistently

MTRNord commented 7 months ago

/kind bug

1. What kops version are you running? The command kops version, will display this information. Client version: 1.29.0-beta.1 (git-v1.29.0-beta.1-154-g87a0483ca3)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6

3. What cloud provider are you using? Hetzner

4. What commands did you run? What is the simplest way to reproduce this issue? kops create cluster --name=cluster-example.k8s.local --ssh-public-key=~/.ssh/id_ed25519.pub --cloud=hetzner --zones=hel1 --networking=cilium --network-cidr=10.10.0.0/16 --node-count=2 --control-plane-count=3 --control-plane-zones=hel1,fsn1 --node-size=cax21 --control-plane-size cax11

5. What happened after the commands executed? All nodes and resources are created however validate fails. The one node only joined after 3 recreations. The other one doesnt join at all:

⬢ [fedora-toolbox:39] ❯ kops validate cluster --wait 10m
I0424 21:34:40.295844 1619707 featureflag.go:168] FeatureFlag "Scaleway"=true
Validating cluster midnightthoughts.k8s.local

INSTANCE GROUPS
NAME            ROLE        MACHINETYPE MIN MAX SUBNETS
control-plane-fsn1-1    ControlPlane    cax11       1   1   fsn1
control-plane-hel1-1    ControlPlane    cax11       1   1   hel1
control-plane-hel1-2    ControlPlane    cax11       1   1   hel1
nodes-hel1      Node        cax21       2   2   hel1

NODE STATUS
NAME                    ROLE        READY
control-plane-fsn1-1-4c0c2fca48e4d3ea   control-plane   True
control-plane-hel1-1-4d7606e4b08b2273   control-plane   True
control-plane-hel1-2-67426331b523d69c   control-plane   True
nodes-hel1-32dbe2a7d622155d     node        True

VALIDATION ERRORS
KIND    NAME        MESSAGE
Machine 46484806    machine "46484806" has not yet joined cluster

Validation Failed

6. What did you expect to happen?

All nodes join

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2024-04-24T19:07:04Z"
  generation: 2
  name: cluster-example.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  certManager:
    enabled: true
  channel: stable
  cloudProvider: hetzner
  configBase: scw://kops-cluster-example/cluster-example.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: control-plane-hel1-1
      name: hel1-1
    - instanceGroup: control-plane-fsn1-1
      name: fsn1-1
    - instanceGroup: control-plane-hel1-2
      name: hel1-2
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: control-plane-hel1-1
      name: hel1-1
    - instanceGroup: control-plane-fsn1-1
      name: fsn1-1
    - instanceGroup: control-plane-hel1-2
      name: hel1-2
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    nodeLocalDNS:
      cpuRequest: 25m
      enabled: true
      memoryRequest: 5Mi
    provider: CoreDNS
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.28.6
  metricsServer:
    enabled: true
  networkCIDR: 10.10.0.0/16
  networking:
    cilium:
      enableNodePort: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  sshKeyName: username@example.com
  subnets:
  - name: fsn1
    type: Public
    zone: fsn1
  - name: hel1
    type: Public
    zone: hel1
  topology:
    dns:
      type: None

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:05Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: control-plane-fsn1-1
spec:
  image: ubuntu-22.04
  machineType: cax11
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - fsn1

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:05Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: control-plane-hel1-1
spec:
  image: ubuntu-22.04
  machineType: cax11
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - hel1

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:05Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: control-plane-hel1-2
spec:
  image: ubuntu-22.04
  machineType: cax11
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - hel1

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:06Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: nodes-hel1
spec:
  image: ubuntu-22.04
  machineType: cax21
  maxSize: 2
  minSize: 2
  role: Node
  subnets:
  - hel1

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Additionally the ssh key seems to not get applied. trying to ssh in only yields a user password request. the SSH key doesnt get accepted.

This was tried well beyond the 10m mark.

MTRNord commented 7 months ago

seems like after the 4th full delete and recreate it works again. I wonder if this is related to #15806

hakman commented 7 months ago

seems like after the 4th full delete and recreate it works again. I wonder if this is related to #15806

Only one way to find out. Please check the kops-configuration logs on failed nodes. Also, I did not test the --zones=hel1,fsn part so not sure if it works with 2 regions.

MTRNord commented 7 months ago

I will have a look when it fails again. Took me some time to realise the user to connect via is not ubuntu but root on the hetzner instances.

Also, I did not test the --zones=hel1,fsn part so not sure if it works with 2 regions.

for the nodes it fails with an hard error, but for control plane it works just fine it seems. No errors or issues as far i was able to tell so far. All servers spawn and kubernetes says everything is happy. I had so far no workload on the cluster though. So it might have bugs I didnt see yet. But I am doubtful that there are any.

MTRNord commented 7 months ago

This time its a control-plane node. It seems to fail on this:

Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: I0427 21:29:16.539628    1091 files.go:136] Hash did not match for "/var/cache/nodeup/sha256:525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57_cni-plugins-linux-arm64-v1_2_0_tgz": actual=sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 vs expected=sha256:525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57
Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: I0427 21:29:16.539684    1091 http.go:82] Downloading "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz"
Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: W0427 21:29:16.747301    1091 assetstore.go:251] error downloading url "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": error response from "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": HTTP 403
Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: W0427 21:29:16.747362    1091 main.go:133] got error running nodeup (will retry in 30s): error adding asset "525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57@https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": error response from "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": HTTP 403

as the server responds with <?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>We're sorry, but this service is not available in your location</Details></Error> which means this is the same bug as #15806 as hetzner has IPs which maxmind sadly recognises as Iran despite not being there. (I dealt with this before with docker and it was a huge hassle to get them to update the IP. Took me multiple explenations to get that fixed.).

TLDR: As a workaround deleting the server and updating to reinit the rest might be easiest here.

hakman commented 7 months ago

TLDR: As a workaround deleting the server and updating to reinit the rest might be easiest here.

That is pretty much the path of least resistance. You may also want to take a look at another issue for some suggestions https://github.com/kubernetes/kops/issues/16466#issuecomment-2063553896.

rehashedsalt commented 6 months ago

Be sure and get in touch with Hetzner via support ticket if you get bit by a blocked IP. Best odds we have of them no longer being blackholed by Google is if Hetzner reaches out to them to see what the deal is.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kops/issues/16491#issuecomment-2397875899): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / kops

Hetzner arm nodes not joining cluster consistently #16491