vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.88k stars 143 forks source link

`hetzner-k3s` unable to ssh to masters, but `openssh` client can connect no problem #443

Open lloesche opened 1 month ago

lloesche commented 1 month ago

Basically what the title says. I upgraded hetzner-k3s to 2.0.8 and tried to create a new cluster. It creates the master nodes but then can't ssh to them. When I manually ssh it works just fine.

[Instance fixsaas-master3] Instance status: off
[Instance fixsaas-master3] Powering on instance (attempt 1)
[Instance fixsaas-master2] Waiting for instance to be powered on...
[Instance fixsaas-master3] Waiting for instance to be powered on...
[Instance fixsaas-master1] Instance status: off
[Instance fixsaas-master1] Powering on instance (attempt 1)
[Instance fixsaas-master1] Waiting for instance to be powered on...
[Instance fixsaas-master3] Instance status: running
[Instance fixsaas-master2] Instance status: running
[Instance fixsaas-master1] Instance status: running
[Instance fixsaas-master3] Waiting for successful ssh connectivity with instance fixsaas-master3...
[Instance fixsaas-master2] Waiting for successful ssh connectivity with instance fixsaas-master2...
[Instance fixsaas-master1] Waiting for successful ssh connectivity with instance fixsaas-master1...
[Instance fixsaas-master2] Instance fixsaas-master2 already exists, skipping create
[Instance fixsaas-master3] Instance fixsaas-master3 already exists, skipping create
[Instance fixsaas-master2] Instance status: running
[Instance fixsaas-master1] Instance fixsaas-master1 already exists, skipping create
[Instance fixsaas-master3] Instance status: running
[Instance fixsaas-master1] Instance status: running
[Instance fixsaas-master2] Waiting for successful ssh connectivity with instance fixsaas-master2...
[Instance fixsaas-master3] Waiting for successful ssh connectivity with instance fixsaas-master3...
[Instance fixsaas-master1] Waiting for successful ssh connectivity with instance fixsaas-master1...
[Instance fixsaas-master2] Instance fixsaas-master2 already exists, skipping create
[Instance fixsaas-master2] Instance status: running
[Instance fixsaas-master1] Instance fixsaas-master1 already exists, skipping create
[Instance fixsaas-master3] Instance fixsaas-master3 already exists, skipping create
[Instance fixsaas-master1] Instance status: running
[Instance fixsaas-master3] Instance status: running
[Instance fixsaas-master2] Waiting for successful ssh connectivity with instance fixsaas-master2...
[Instance fixsaas-master1] Waiting for successful ssh connectivity with instance fixsaas-master1...
[Instance fixsaas-master3] Waiting for successful ssh connectivity with instance fixsaas-master3...
Error creating instance: timeout after 00:01:00
Instance creation for fixsaas-master2 failed. Try rerunning the create command.
Error creating instance: timeout after 00:01:00
Instance creation for fixsaas-master1 failed. Try rerunning the create command.
Error creating instance: timeout after 00:01:00
Instance creation for fixsaas-master3 failed. Try rerunning the create command.
^C

It's never able to ssh to the master nodes, but if I just manually try I can connect to all of them no problem:

minimi:conf lukas$ ssh -i id_ecdsa root@188.245.122.2
The authenticity of host '188.245.122.2 (188.245.122.2)' can't be established.
ED25519 key fingerprint is SHA256:OH8Fmrk1IHcqbuFaMntaL+fvgm3cUPD0rz5SL91ltnk.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '188.245.122.2' (ED25519) to the list of known hosts.
X11 forwarding request failed on channel 0
Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-41-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Mon Sep  9 12:43:06 PM UTC 2024

  System load:  1.19              Processes:             143
  Usage of /:   2.4% of 74.79GB   Users logged in:       0
  Memory usage: 20%               IPv4 address for eth0: 188.245.122.2
  Swap usage:   0%                IPv6 address for eth0: 2a01:4f8:1c17:5c0e::1

Expanded Security Maintenance for Applications is not enabled.

0 updates can be applied immediately.

Enable ESM Apps to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status

root@fixsaas-master1:~#
logout
Connection to 188.245.122.2 closed.

My config:

minimi:conf lukas$ cat hetzner-k3s.yaml
hetzner_token: xxx
cluster_name: fixsaas
kubeconfig_path: /Users/lukas/fixstrap/bcdr/conf/hetzner-k3s.kubeconfig
k3s_version: v1.29.6+k3s2
public_ssh_key_path: /Users/lukas/fixstrap/bcdr/conf/id_ecdsa.pub
private_ssh_key_path: /Users/lukas/fixstrap/bcdr/conf/id_ecdsa
use_ssh_agent: false
ssh_allowed_networks:
- 93.240.xxx/32
- 90.187.xxx/32
api_allowed_networks:
- 93.240.xxx/32
- 90.187.xxx/32
private_network_subnet: 10.0.0.0/16
schedule_workloads_on_masters: false
cluster_cidr: 10.244.0.0/16
service_cidr: 10.43.0.0/16
cluster_dns: 10.43.0.10
additional_packages:
- ntp
- lynis
- clamav
- clamav-daemon
- chkrootkit
- unattended-upgrades
- update-notifier-common
- nfs-common
- cryptsetup
- libpam-tmpdir
- apt-listchanges
- apt-show-versions
post_create_commands:
- sysctl -w 'vm.max_map_count=1024000'
- echo 'vm.max_map_count=1024000' | tee /etc/sysctl.d/60-arangodb.conf
- echo 'unattended-upgrades unattended-upgrades/enable_auto_updates boolean true'
  | debconf-set-selections
- DEBIAN_FRONTEND=noninteractive dpkg-reconfigure --priority=low unattended-upgrades
- systemctl enable clamav-freshclam
- systemctl start clamav-freshclam
- systemctl enable clamav-daemon
- systemctl start clamav-daemon
- systemctl enable unattended-upgrades
- systemctl start unattended-upgrades
- systemctl enable ntp
- systemctl start ntp
- apt-get update
- apt-get upgrade -y
- apt-get autoremove -y
- apt-get autoclean -y
cloud_controller_manager_manifest_url: https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.20.0/ccm-networks.yaml
csi_driver_manifest_url: https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.8.0/deploy/kubernetes/hcloud-csi.yml
system_upgrade_controller_manifest_url: https://raw.githubusercontent.com/rancher/system-upgrade-controller/v0.13.4/manifests/system-upgrade-controller.yaml
masters_pool:
  instance_type: ccx13
  instance_count: 3
  location: fsn1
worker_node_pools:
- name: workers
  instance_type: ccx33
  instance_count: 5
  location: fsn1
  labels:
  - key: node-role.fixcloud.io
    value: worker
- name: db
  instance_type: ccx33
  instance_count: 2
  location: fsn1
  labels:
  - key: node-role.fixcloud.io
    value: database
  taints:
  - key: node-role.fixcloud.io/dedicated
    value: database:NoSchedule
- name: jobs
  instance_type: ccx23
  instance_count: 2
  location: fsn1
  labels:
  - key: node-role.fixcloud.io
    value: jobs
  taints:
  - key: node-role.fixcloud.io/dedicated
    value: jobs:NoSchedule

Any ideas what might go wrong?

lloesche commented 1 month ago

FYI, I also tried use_ssh_agent: true and added the key to my agent. Same result. Works fine on the console but hetzner-k3s is never able to connect.

lloesche commented 1 month ago

I'm just looking at the 2.x release notes and seeing that my config doesn't match the expected format at all. Yet the tool didn't complain about any of it which seems odd.

I didn't plan on upgrading to 2.x but it seems 1.1.5 stopped working on network creation (throws a 400 json format error).

Ironically if I first let it create the network and servers with 2.0.8 and then abort and downgrade to 1.1.5 everything works fine and it sshs to the nodes and deploys k3s.

FWIW, this is the log after unsuccessfully running 2.0.8 and then downgrading to 1.1.5:

=== Creating infrastructure resources ===
Network already exists, skipping.
Creating firewall...done.
SSH key already exists, skipping.
Placement group fixsaas-masters already exists, skipping.
Creating placement group fixsaas-workers-1...done.
Creating placement group fixsaas-db-1...done.
Creating placement group fixsaas-jobs-1...done.
Creating server fixsaas-ccx13-master2...
Creating server fixsaas-ccx13-master3...
Creating server fixsaas-ccx33-pool-workers-worker3...
Creating server fixsaas-ccx33-pool-workers-worker2...
Creating server fixsaas-ccx33-pool-workers-worker4...
Creating server fixsaas-ccx13-master1...
Creating server fixsaas-ccx33-pool-workers-worker1...
Creating server fixsaas-ccx33-pool-db-worker1...
Creating server fixsaas-ccx33-pool-db-worker2...
Creating server fixsaas-ccx33-pool-workers-worker5...
...server fixsaas-ccx13-master3 created.
...server fixsaas-ccx33-pool-workers-worker1 created.
...server fixsaas-ccx33-pool-db-worker1 created.
...server fixsaas-ccx13-master2 created.
...server fixsaas-ccx33-pool-db-worker2 created.
...server fixsaas-ccx33-pool-workers-worker3 created.
...server fixsaas-ccx13-master1 created.
...server fixsaas-ccx33-pool-workers-worker5 created.
...server fixsaas-ccx33-pool-workers-worker2 created.
...server fixsaas-ccx33-pool-workers-worker4 created.
Server fixsaas-ccx13-master1 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx13-master1...
Server fixsaas-ccx13-master2 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx13-master2...
Server fixsaas-ccx13-master3 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx13-master3...
Server fixsaas-ccx33-pool-workers-worker1 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx33-pool-workers-worker1...
Server fixsaas-ccx33-pool-workers-worker2 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx33-pool-workers-worker2...
Server fixsaas-ccx33-pool-workers-worker3 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx33-pool-workers-worker3...
Server fixsaas-ccx33-pool-workers-worker4 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx33-pool-workers-worker4...
Server fixsaas-ccx33-pool-workers-worker5 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx33-pool-workers-worker5...
Server fixsaas-ccx33-pool-db-worker1 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx33-pool-db-worker1...
Server fixsaas-ccx33-pool-db-worker2 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx33-pool-db-worker2...
...server fixsaas-ccx13-master3 is now up.
...server fixsaas-ccx13-master2 is now up.
...server fixsaas-ccx13-master1 is now up.
...server fixsaas-ccx33-pool-workers-worker1 is now up.
...server fixsaas-ccx33-pool-workers-worker2 is now up.
...server fixsaas-ccx33-pool-db-worker1 is now up.
...server fixsaas-ccx33-pool-workers-worker3 is now up.
...server fixsaas-ccx33-pool-workers-worker5 is now up.
...server fixsaas-ccx33-pool-db-worker2 is now up.
...server fixsaas-ccx33-pool-workers-worker4 is now up.
Creating server fixsaas-ccx23-pool-jobs-worker1...
Creating server fixsaas-ccx23-pool-jobs-worker2...
...server fixsaas-ccx23-pool-jobs-worker1 created.
...server fixsaas-ccx23-pool-jobs-worker2 created.
Server fixsaas-ccx23-pool-jobs-worker1 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx23-pool-jobs-worker1...
Server fixsaas-ccx23-pool-jobs-worker2 already exists, skipping.
Waiting for successful ssh connectivity with server fixsaas-ccx23-pool-jobs-worker2...
...server fixsaas-ccx23-pool-jobs-worker1 is now up.
...server fixsaas-ccx23-pool-jobs-worker2 is now up.
Creating load balancer for API server...done.

=== Setting up Kubernetes ===
Deploying k3s to first master fixsaas-ccx13-master1...
[INFO]  Using v1.29.6+k3s2 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s2/sha256sum-amd64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.29.6+k3s2/k3s
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Creating /usr/local/bin/ctr symlink to k3s
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
[INFO]  systemd: Starting k3s
Waiting for the control plane to be ready...

So I'd now assume that between 1.1.5 and 2.0.8 there's a regression in the way ssh works.

vitobotta commented 1 month ago

I'm just looking at the 2.x release notes and seeing that my config doesn't match the expected format at all. Yet the tool didn't complain about any of it which seems odd.

The tool expects a YAML file and most settings have default values, so if it doesn't find those settings in your config it will just use the default values, and only complain if required settings that you must specify are missing.

I didn't plan on upgrading to 2.x but it seems 1.1.5 stopped working on network creation (throws a 400 json format error).

There have been some changes on the Hetzner side that broke the API client functionality for some things, but I didn't want to maintain the 1.x branch anymore since I can only work on this project in my free time.

Ironically if I first let it create the network and servers with 2.0.8 and then abort and downgrade to 1.1.5 everything works fine and it sshs to the nodes and deploys k3s. So I'd now assume that between 1.1.5 and 2.0.8 there's a regression in the way ssh works.

Not that I am aware of and you are the first person to report this issue since releasing 2.x. Can you share your current, updated configuration?

rajko135 commented 1 week ago

I am having the same issue

I modified the ssh.cr file to print the error and is giving me the following:

Error: ERR -18: Username/PublicKey combination invalid

And this is my current configuration:

cluster_name: s2-k3s-mail-cluster
kubeconfig_path: ./kubeconfig
k3s_version: v1.30.3+k3s1

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "~/.ssh/id_rsa.pub"
    private_key_path: "~/.ssh/id_rsa"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_networks:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: flannel

manifest:
  system_upgrade_controller_deployment_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
  system_upgrade_controller_crd_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/crd.yaml"

datastore:
  mode: etcd

schedule_workloads_on_masters: false

image: debian-12

#### Cluster server groups ####
masters_pool:
  instance_type: cx22
  instance_count: 3
  location: nbg1

worker_node_pools:
- name: small-mail-pool
  instance_type: cx32
  instance_count: 3
  location: nbg1
  labels:
    - key: "node-type"
      value: "small-mail"

additional_packages:
  - open-iscsi
post_create_commands:
  - apt update
  - apt upgrade -y
  - apt autoremove -y
vitobotta commented 3 days ago

I am having the same issue

I modified the ssh.cr file to print the error and is giving me the following:

Error: ERR -18: Username/PublicKey combination invalid

And this is my current configuration:

cluster_name: s2-k3s-mail-cluster
kubeconfig_path: ./kubeconfig
k3s_version: v1.30.3+k3s1

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "~/.ssh/id_rsa.pub"
    private_key_path: "~/.ssh/id_rsa"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_networks:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: flannel

manifest:
  system_upgrade_controller_deployment_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
  system_upgrade_controller_crd_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/crd.yaml"

datastore:
  mode: etcd

schedule_workloads_on_masters: false

image: debian-12

#### Cluster server groups ####
masters_pool:
  instance_type: cx22
  instance_count: 3
  location: nbg1

worker_node_pools:
- name: small-mail-pool
  instance_type: cx32
  instance_count: 3
  location: nbg1
  labels:
    - key: "node-type"
      value: "small-mail"

additional_packages:
  - open-iscsi
post_create_commands:
  - apt update
  - apt upgrade -y
  - apt autoremove -y

I haven't come across this one before and it looks super weird since the user is always root by default and it should work if the SSH keys are correct. Can you SSH to the nodes manually with the same keys?

rajko135 commented 3 days ago

I can

vitobotta commented 3 days ago

Which OS are you on?

rajko135 commented 2 days ago

I have fixed my issue, so in my specific situation i was building it locally and didn't have the proper version on libssh2.

My previous version was:

libssh2-1:amd64                         1.10.0-3  

I installed the latest version and now it works

vitobotta commented 2 days ago

I have fixed my issue, so in my specific situation i was building it locally and didn't have the proper version on libssh2.

My previous version was:

libssh2-1:amd64                         1.10.0-3  

I installed the latest version and now it works

Glad you figured it out