vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.91k stars 143 forks source link

2.0.2 - Stuck on masters node creation. #415

Open Alilat-imad opened 3 months ago

Alilat-imad commented 3 months ago

Scenario to reproduce :

hetzner_token: <TOKEN>
cluster_name: k3s-cluster
k3s_version: v1.30.3+k3s1
kubeconfig_path: "~/.kube/config"

networking:
  ssh:
    port: 22
    use_agent: false # set to true if your key has a passphrase
    public_key_path: "~/.ssh/id_ed25519.pub"
    private_key_path: "~/.ssh/id_ed25519"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api: # this will firewall port 6443 on the nodes; it will NOT firewall the API load balancer
      - 0.0.0.0/0
  public_network:
    ipv4: false
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: "ALILAT_NETWORK"
  cni:
    enabled: true
    encryption: false
    mode: flannel

schedule_workloads_on_masters: false
embedded_registry_mirror:
  enabled: true

datastore:
  #mode: etcd # etcd (default) or external
  external_datastore_endpoint: postgresql://<USER>:<PASSWORD>@>PRIVATE_IP_POSTGRES_SERVER>:5432/k3s_cluster?sslmode=prefer

image: debian-12
autoscaling_image: debian-12

masters_pool:
 instance_type: cx22
 instance_count: 3
 location: nbg1

worker_node_pools:
- name: worker-cx32-autoscaled
  instance_type: cx32
  instance_count: 2
  location: nbg1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 4

post_create_commands:
 - apt update
 - apt upgrade -y
 - apt autoremove -y

After exec hetzner-k3s create --config hetzner-k3s-config.yml | tee create.log

The output :

hetzner-k3s create --config hetzner-k3s-config.yml | tee create.log
[Configuration] Validating configuration...
[Configuration] ...configuration seems valid.
[SSH key] SSH key already exists, skipping create
[Instance k3s-cluster-master2] Creating instance k3s-cluster-master2 (attempt 1)...
[Instance k3s-cluster-master3] Creating instance k3s-cluster-master3 (attempt 1)...
[Instance k3s-cluster-master1] Creating instance k3s-cluster-master1 (attempt 1)...
[Instance k3s-cluster-master3] Instance status: off
[Instance k3s-cluster-master3] Powering on instance (attempt 1)
[Instance k3s-cluster-master3] Waiting for instance to be powered on...
[Instance k3s-cluster-master2] Instance status: off
[Instance k3s-cluster-master2] Powering on instance (attempt 1)
[Instance k3s-cluster-master1] Instance status: off
[Instance k3s-cluster-master1] Powering on instance (attempt 1)
[Instance k3s-cluster-master2] Waiting for instance to be powered on...
[Instance k3s-cluster-master1] Waiting for instance to be powered on...
[Instance k3s-cluster-master3] Instance status: running
[Instance k3s-cluster-master2] Instance status: running
[Instance k3s-cluster-master1] Instance status: running
[Instance k3s-cluster-master3] Waiting for successful ssh connectivity with instance k3s-cluster-master3...
[Instance k3s-cluster-master2] Waiting for successful ssh connectivity with instance k3s-cluster-master2...
[Instance k3s-cluster-master1] Waiting for successful ssh connectivity with instance k3s-cluster-master1...
[Instance k3s-cluster-master3] Instance k3s-cluster-master3 already exists, skipping create
[Instance k3s-cluster-master1] Instance k3s-cluster-master1 already exists, skipping create
[Instance k3s-cluster-master2] Instance k3s-cluster-master2 already exists, skipping create
[Instance k3s-cluster-master3] Instance status: running
[Instance k3s-cluster-master2] Instance status: running
[Instance k3s-cluster-master1] Instance status: running
[Instance k3s-cluster-master3] Waiting for successful ssh connectivity with instance k3s-cluster-master3...
[Instance k3s-cluster-master2] Waiting for successful ssh connectivity with instance k3s-cluster-master2...
[Instance k3s-cluster-master1] Waiting for successful ssh connectivity with instance k3s-cluster-master1...
[Instance k3s-cluster-master3] Instance k3s-cluster-master3 already exists, skipping create
[Instance k3s-cluster-master1] Instance k3s-cluster-master1 already exists, skipping create
[Instance k3s-cluster-master3] Instance status: running
[Instance k3s-cluster-master1] Instance status: running
[Instance k3s-cluster-master2] Instance k3s-cluster-master2 already exists, skipping create
[Instance k3s-cluster-master2] Instance status: running
[Instance k3s-cluster-master3] Waiting for successful ssh connectivity with instance k3s-cluster-master3...
[Instance k3s-cluster-master1] Waiting for successful ssh connectivity with instance k3s-cluster-master1...
[Instance k3s-cluster-master2] Waiting for successful ssh connectivity with instance k3s-cluster-master2...
Error creating instance: timeout after 00:01:00
Instance creation for k3s-cluster-master3 failed. Try rerunning the create command.
Error creating instance: timeout after 00:01:00
Instance creation for k3s-cluster-master1 failed. Try rerunning the create command.
Error creating instance: timeout after 00:01:00
Instance creation for k3s-cluster-master2 failed. Try rerunning the create command.

And then nothing happen, no more attempt no more outputs.

vitobotta commented 3 months ago

Please edit and remove your token asap if that's a valid token :) I'll read your message and reply shortly

vitobotta commented 3 months ago

You have disabled the public ipv4 interface. Are you executing these commands from a server in the same private network as the cluster? If not that's what you need to do or the computer from which you run commands will not be able to reach the nodes. Also, when you disable the public interface you need some additional setup to be able to use Internet from the nodes and download packages etc. You need to set up the NAT gateway. This of setup is not yet described in the docs since it's less common but for the time being you can refer to the post create commands in https://github.com/vitobotta/hetzner-k3s/discussions/385#discussioncomment-10168998 as an example. But you also need to configure the NAT gateway.

Alilat-imad commented 3 months ago

I previously created a server on Hetzner to install PostgreSQL. Since IPv4 came at an additional cost, I opted for IPv6 only and use it for SSH access.

So I need to have the ipv4 enabled even for worker nodes ?

Any way, once I've did the bellow changes, everything works well.

Alilat-imad commented 3 months ago

Now I have another issue where I need your precious assistance.

networking:
  public_network:
    ipv4: false
    ipv6: true

Trying to use a k3s datastore into my postgres instance instead of the default etcd. To do that, i've created a private network group where is found my postgres server. Than I created the k3s cluster using that same network group. but it seems that the property external_datastore_endpoint is being ignored.

First i thought I've mistaked the url by doing :

datastore:
  external_datastore_endpoint: postgres://user:password@PRIVATE_IP:5432/k3s_cluster_db

Than I've fixhed the url to this :

datastore:
  external_datastore_endpoint: postgresql://user:password@POSTGRES_PRIVATE_IP:5432/k3s_cluster_db

But it didn't work, (etc is the one being used)

Next obvious verification I've tried was to ssh into the master node and ping POSTGRES_PRIVATE_IP it did work well.

Last check i've did was

sudo apt-get install postgresql-client
psql postgresql://user:password@PRIVATE_IP:5432/k3s_cluster_db

It also did work.

vitobotta commented 3 months ago

I previously created a server on Hetzner to install PostgreSQL. Since IPv4 came at an additional cost, I opted for IPv6 only and use it for SSH access.

So I need to have the ipv4 enabled even for worker nodes ?

Any way, once I've did the bellow changes, everything works well.

You can disable the public IPs if you prefer, but in order to access the cluster with hetzner-k3s you need to run it from a server in the same private network. I haven't tested with ipv6 only to be honest.

vitobotta commented 3 months ago

Now I have another issue where I need your precious assistance.

networking:
  public_network:
    ipv4: false
    ipv6: true

Trying to use a k3s datastore into my postgres instance instead of the default etcd. To do that, i've created a private network group where is found my postgres server. Than I created the k3s cluster using that same network group. but it seems that the property external_datastore_endpoint is being ignored.

First i thought I've mistaked the url by doing :

datastore:
  external_datastore_endpoint: postgres://user:password@PRIVATE_IP:5432/k3s_cluster_db

Than I've fixhed the url to this :

datastore:
  external_datastore_endpoint: postgresql://user:password@POSTGRES_PRIVATE_IP:5432/k3s_cluster_db

But it didn't work, (etc is the one being used)

Next obvious verification I've tried was to ssh into the master node and ping POSTGRES_PRIVATE_IP it did work well.

Last check i've did was

sudo apt-get install postgresql-client
psql postgresql://user:password@PRIVATE_IP:5432/k3s_cluster_db

It also did work.

You forgot to set the mode of the datastore to external:

datastore:
  mode: external
  external_datastore_endpoint: postgres://....
Alilat-imad commented 3 months ago

I was missing the mode property set to external

My new config look like this :

datastore:
  mode: external # etcd (default) or external
  external_datastore_endpoint: postgresql://user:password@PRIVATE_IP:5432/k3s_cluster_db

And I am getting the bellow error :

[Control plane] Generating the kubeconfig file to /Users/USER/.kube/config... error: no context exists with the name: "k3s-cluster-master1"

But once I switch mode to etcd everything work's well.

vitobotta commented 3 months ago

By "Switch" do you mean that you changed datastore type on an existing cluster? If yes, that's not supported. The datastore choice is permanent for the life of the cluster.

Alilat-imad commented 3 months ago

Indeed, I changed the datastore type before the creation of the cluster. I appreciate your guidance on this matter.

I've done multiple Creation/deletion and the result is the same, the bellow configuration doesn't work :

datastore:
  mode: external 
  external_datastore_endpoint: postgresql://user:password@PRIVATE_IP:5432/k3s_cluster_db

Here is the error I got :

[Instance k3s-cluster-master1] [INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.30.3+k3s1/k3s
[Instance k3s-cluster-master1] [INFO]  Verifying binary download
[Instance k3s-cluster-master1] [INFO]  Installing k3s to /usr/local/bin/k3s
[Instance k3s-cluster-master1] [INFO]  Skipping installation of SELinux RPM
[Instance k3s-cluster-master1] [INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[Instance k3s-cluster-master1] [INFO]  Creating /usr/local/bin/crictl symlink to k3s
[Instance k3s-cluster-master1] [INFO]  Creating /usr/local/bin/ctr symlink to k3s
[Instance k3s-cluster-master1] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance k3s-cluster-master1] [INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[Instance k3s-cluster-master1] [INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[Instance k3s-cluster-master1] [INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[Instance k3s-cluster-master1] [INFO]  systemd: Enabling k3s unit
[Instance k3s-cluster-master1] [INFO]  systemd: Starting k3s
[Instance k3s-cluster-master1] Waiting for the control plane to be ready...
[Control plane] Generating the kubeconfig file to /Users/username/.kube/config...
error: no context exists with the name: "k3s-cluster-master1"
[Control plane] ...kubeconfig file generated as /Users/username/.kube/config.
Unhandled exception in spawn: timeout after 00:00:30 (Tasker::Timeout)
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'raise<Tasker::Timeout>:NoReturn'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Tasker@Tasker::Methods::timeout<Time::Span, &Proc(Nil)>:Nil'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in '~procProc(Nil)@src/cluster/create.cr:75'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Fiber#run:(IO::FileDescriptor | Nil)'

Then, i've thought my be their is an issue with kubeconfig_path: "~/.kube/config" so I've switched it to kubeconfig_path: "./kubeconfig" but same error :

error: no context exists with the name: "k3s-cluster-master1"
[Control plane] ...kubeconfig file generated as /Users/username/Projects/side-projects/infra/kubeconfig.
Unhandled exception in spawn: timeout after 00:00:30 (Tasker::Timeout)
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'raise<Tasker::Timeout>:NoReturn'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Tasker@Tasker::Methods::timeout<Time::Span, &Proc(Nil)>:Nil'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in '~procProc(Nil)@src/cluster/create.cr:75'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Fiber#run:(IO::FileDescriptor | Nil)'
Alilat-imad commented 3 months ago

I previously created a server on Hetzner to install PostgreSQL. Since IPv4 came at an additional cost, I opted for IPv6 only and use it for SSH access. So I need to have the ipv4 enabled even for worker nodes ? Any way, once I've did the bellow changes, everything works well.

You can disable the public IPs if you prefer, but in order to access the cluster with hetzner-k3s you need to run it from a server in the same private network. I haven't tested with ipv6 only to be honest.

I've tried it but it get stuck on the creation

[Configuration] Validating configuration...
[Configuration] ...configuration seems valid.
[SSH key] SSH key already exists, skipping create
[Placement groups] Creating placement group k3s-cluster-masters...
[Placement groups] ...placement group k3s-cluster-masters created
[Instance k3s-cluster-master1] Creating instance k3s-cluster-master1 (attempt 1)...
[Instance k3s-cluster-master1] Instance status: off
[Instance k3s-cluster-master1] Powering on instance (attempt 1)
[Instance k3s-cluster-master1] Waiting for instance to be powered on...
[Instance k3s-cluster-master1] Instance status: running
[Instance k3s-cluster-master1] Waiting for successful ssh connectivity with instance k3s-cluster-master1...
[Instance k3s-cluster-master1] Instance k3s-cluster-master1 already exists, skipping create
[Instance k3s-cluster-master1] Instance status: running
[Instance k3s-cluster-master1] Waiting for successful ssh connectivity with instance k3s-cluster-master1...
[Instance k3s-cluster-master1] Instance k3s-cluster-master1 already exists, skipping create
[Instance k3s-cluster-master1] Instance status: running
[Instance k3s-cluster-master1] Waiting for successful ssh connectivity with instance k3s-cluster-master1...
Error creating instance: timeout after 00:01:00
Instance creation for k3s-cluster-master1 failed. Try rerunning the create command.
vitobotta commented 3 months ago

Indeed, I changed the datastore type before the creation of the cluster. I appreciate your guidance on this matter.

I've done multiple Creation/deletion and the result is the same, the bellow configuration doesn't work :

datastore:
  mode: external 
  external_datastore_endpoint: postgresql://user:password@PRIVATE_IP:5432/k3s_cluster_db

Here is the error I got :

[Instance k3s-cluster-master1] [INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.30.3+k3s1/k3s
[Instance k3s-cluster-master1] [INFO]  Verifying binary download
[Instance k3s-cluster-master1] [INFO]  Installing k3s to /usr/local/bin/k3s
[Instance k3s-cluster-master1] [INFO]  Skipping installation of SELinux RPM
[Instance k3s-cluster-master1] [INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[Instance k3s-cluster-master1] [INFO]  Creating /usr/local/bin/crictl symlink to k3s
[Instance k3s-cluster-master1] [INFO]  Creating /usr/local/bin/ctr symlink to k3s
[Instance k3s-cluster-master1] [INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[Instance k3s-cluster-master1] [INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[Instance k3s-cluster-master1] [INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[Instance k3s-cluster-master1] [INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[Instance k3s-cluster-master1] [INFO]  systemd: Enabling k3s unit
[Instance k3s-cluster-master1] [INFO]  systemd: Starting k3s
[Instance k3s-cluster-master1] Waiting for the control plane to be ready...
[Control plane] Generating the kubeconfig file to /Users/username/.kube/config...
error: no context exists with the name: "k3s-cluster-master1"
[Control plane] ...kubeconfig file generated as /Users/username/.kube/config.
Unhandled exception in spawn: timeout after 00:00:30 (Tasker::Timeout)
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'raise<Tasker::Timeout>:NoReturn'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Tasker@Tasker::Methods::timeout<Time::Span, &Proc(Nil)>:Nil'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in '~procProc(Nil)@src/cluster/create.cr:75'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Fiber#run:(IO::FileDescriptor | Nil)'

Then, i've thought my be their is an issue with kubeconfig_path: "~/.kube/config" so I've switched it to kubeconfig_path: "./kubeconfig" but same error :

error: no context exists with the name: "k3s-cluster-master1"
[Control plane] ...kubeconfig file generated as /Users/username/Projects/side-projects/infra/kubeconfig.
Unhandled exception in spawn: timeout after 00:00:30 (Tasker::Timeout)
 from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'raise<Tasker::Timeout>:NoReturn'
 from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Tasker@Tasker::Methods::timeout<Time::Span, &Proc(Nil)>:Nil'
 from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in '~procProc(Nil)@src/cluster/create.cr:75'
 from /opt/homebrew/Cellar/hetzner_k3s/2.0.2/bin/hetzner-k3s in 'Fiber#run:(IO::FileDescriptor | Nil)'

Do you see any data in the pg database?

Alilat-imad commented 3 months ago

I've just checked pgAdmin, the k3s_cluster db have no table on it.

vitobotta commented 3 months ago

And you can connect with psql to postgresql://user:password@PRIVATE_IP:5432/k3s_cluster_db?

Alilat-imad commented 3 months ago

Yes I confirm it does work

vitobotta commented 3 months ago

I was planning to do a test cluster with a postgres db tonight but I just finished working and it's 1AM, so I'll have to postpone this to tomorrow night.

vitobotta commented 3 months ago

Hi, I could try now to figure out the problem you're having, but it would save me some time if you could describe step by step what you have done, from how you have set up the existing network, Internet access from the servers, and the Postgres server. Any detail you can provide may help me with the investigation, because I don't see any problems with a regular setup using a Postgres database, so it's perhaps something I am missing from your setup.

quorak commented 2 months ago

I had the same issue, with version ´v1.26.9+k3s1´

Thank you for this thread, I could solve it with adding this to the condig:

embedded_registry_mirror:
  enabled: false

Next Fail on master updates was:

[System Upgrade Controller] deployment.apps/system-upgrade-controller configured
The ClusterRoleBinding "system-upgrade" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system-upgrade-controller"}: cannot change roleRef
[System Upgrade Controller] : The ClusterRoleBinding "system-upgrade" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system-upgrade-controller"}: cannot change roleRef

But I could fix it with: kubectl delete clusterrolebinding system-upgrade

Thank you so much for the great work. Consider linking the upgrade notes to the readme. It took me while to find them and nearly destroying my cluster :D

quorak commented 2 months ago

New servers now unfortunately seems to cannot connect to the network like:

(combined from similar events): Could not create route b5d2cae7-4afb-486a-b9a3-d35f12bd2a1a 10.244.4.0/24 for node cl11-pool-cpx41-worker1 after 287.960811ms: hcloud/CreateRoute: hcops/AllServersCache.ByName: cl11-pool-cpx41-worker1 hcops/AllServersCache.getCache: not found

vitobotta commented 2 months ago

I had the same issue, with version ´v1.26.9+k3s1´

Thank you for this thread, I could solve it with adding this to the condig:

embedded_registry_mirror:
  enabled: false

Next Fail on master updates was:

[System Upgrade Controller] deployment.apps/system-upgrade-controller configured
The ClusterRoleBinding "system-upgrade" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system-upgrade-controller"}: cannot change roleRef
[System Upgrade Controller] : The ClusterRoleBinding "system-upgrade" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system-upgrade-controller"}: cannot change roleRef

But I could fix it with: kubectl delete clusterrolebinding system-upgrade

Thank you so much for the great work. Consider linking the upgrade notes to the readme. It took me while to find them and nearly destroying my cluster :D

I'll add an "Upgrading page" when I have a bit of time (or you could make a PR? :)) But the upgrade instructions are defined in the 2.0.0 release notes and linked to in the following minor releases so who is upgrading should see them easily.

vitobotta commented 2 months ago

New servers now unfortunately seems to cannot connect to the network like:

(combined from similar events): Could not create route b5d2cae7-4afb-486a-b9a3-d35f12bd2a1a 10.244.4.0/24 for node cl11-pool-cpx41-worker1 after 287.960811ms: hcloud/CreateRoute: hcops/AllServersCache.ByName: cl11-pool-cpx41-worker1 hcops/AllServersCache.getCache: not found

Please open a separate issue with the details including your config file.

quorak commented 2 months ago

thank you, I got it fixed, the hostname and the name of the server in hetzner console did not match. But this could have likely been my error of setting include_instance_type_in_instance_name to late.

vitobotta commented 3 weeks ago

thank you, I got it fixed, the hostname and the name of the server in hetzner console did not match. But this could have likely been my error of setting include_instance_type_in_instance_name to late.

I guess we can close this issue then? :)