Closed bulnv closed 1 year ago
the machine is available sometime from local-hetzner-network, so I was able to get some logs. k3s-agent not installed. Here what I've found from the cloud-init-output log, posting here so far only errors cause its huge
2023-03-06 15:32:41,671 - util.py[WARNING]: failed stage init-local
failed run of stage init-local
------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname
uid = pwd.getpwnam(user).pw_uid
KeyError: "getpwnam(): name not found: 'systemd-network'"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 767, in status_wrapper
ret = functor(name, args)
File "/usr/lib/python3.10/site-packages/cloudinit/cmd/main.py", line 433, in main_init
init.apply_network_config(bring_up=bring_up_interfaces)
File "/usr/lib/python3.10/site-packages/cloudinit/stages.py", line 939, in apply_network_config
return self.distro.apply_network_config(
File "/usr/lib/python3.10/site-packages/cloudinit/distros/__init__.py", line 278, in apply_network_config
self._write_network_state(network_state, renderer)
File "/usr/lib/python3.10/site-packages/cloudinit/distros/__init__.py", line 167, in _write_network_state
renderer.render_network_state(network_state)
File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 306, in render_network_state
self.create_network_file(k, v, network_dir)
File "/usr/lib/python3.10/site-packages/cloudinit/net/networkd.py", line 290, in create_network_file
util.chownbyname(net_fn, net_fn_owner, net_fn_owner)
File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1612, in chownbyname
raise OSError("Unknown user or group: %s" % (e)) from e
OSError: Unknown user or group: "getpwnam(): name not found: 'systemd-network'"
sed: can't read /etc/sysconfig/network/config: No such file or directory
sed: can't read /etc/sysconfig/network/dhcp: No such file or directory
sed: can't read /etc/sysconfig/network/config: No such file or directory
+ curl -sfL https://get.k3s.io
+ INSTALL_K3S_SKIP_START=true
+ INSTALL_K3S_SKIP_SELINUX_RPM=true
+ INSTALL_K3S_CHANNEL=v1.25
+ INSTALL_K3S_EXEC=agent
+ sh -
+ /sbin/semodule -v -i /usr/share/selinux/packages/k3s.pp
Attempting to install module '/usr/share/selinux/packages/k3s.pp':
Ok: return value of 0.
Committing changes:
Ok: transaction number 7.
Failed to start k3s-agent.service: Unit k3s-agent.service not found.
2023-03-15 21:38:04,372 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2023-03-15 21:38:04,373 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.10/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 23.1-1.1 finished at Wed, 15 Mar 2023 21:38:04 +0000. Datasource DataSourceHetzner. Up 50.61 seconds
@bulnv This is caused by an old bug. Here's what to try (basically recreate the snapshot used by the autoscaler after upgrading):
Before anything, delete all failed autoscaled nodes with the hcloud cli or the UI.
Upgrade to the latest by removing the version attribute in your kube.tf if you have it and running terraform init -upgrade
.
Make sure you have your control-plane in HA, with at least 3 nodes, change and apply if necessary to make it so.
Use terraform state list
, to find the name of your first control plane, alternatively look in your terraform.tfstate file. Usually it starts with "0-0-control-plane-*', you have to find its node name too, the name use in k3s and hcloud.
Drain that node with kubectl drain node <first-control-plane-node-name>
.
Terraform destroy that node with terraform destroy -target module.kube-hetzner.module.control_planes["0-0-control-plane-*"].hcloud_server.server
Now do the same thing for the snapshot: terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0]
Runterraform plan
and you should see that it wants to create the first control plane node again and the snapshot, that's what we want.
Terraform apply.
@mysticaltech thank you so much for a detailed answer (as usual). Everything seems clear. Will try it out.
@mysticaltech just tried what you mentioned and have some updates. I've successfully scaled masters up to 3 nodes. But when I am trying to destroy the first node I've got in a plan to destroy 29 resources including null resources for all the nodes, the load balancer, and all my helm releases. So I guess this terraform action can just blow up the cluster. So I did next:
looking forward to hearing from you any ideas
only one thing, I am still on a relatively fresh "1.9.8" version. Does this make sense?
@bulnv If you scaled successfully, that's all that matters! Just recreating the snapshot was my first advice in a similar issue.
@bulnv If you scaled successfully, that's all that matters! Just recreating the snapshot was my first advice in a similar issue.
Nonono =)), I mean the problem persists even with different snapshot, freshly taken from the different master!
So running that terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0]
and recreating via terraform did not work correct?
Also, 1.9.8 is way too old, that's the reason! Please update to 1.10.3. (1.10.4 has a small issue I will fix later today).
So running that
terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0]
and recreating via terraform did not work correct?
i haven't tried (check above) cause it going to destroy huge part of cluster including loadbalancer and helm releases. I have no chance to do that
Also, 1.9.8 is way too old, that's the reason! Please update to 1.10.3. (1.10.4 has a small issue I will fix later today).
ok, will try out
So running that
terraform destroy -target module.kube-hetzner.hcloud_snapshot.autoscaler_image[0]
and recreating via terraform did not work correct?i haven't tried (check above) cause it going to destroy huge part of cluster including loadbalancer and helm releases. I have no chance to do that
Impossible, that will just delete the autoscaler image! Please try again, or post the plan to prove me wrong haha.
But upgrade to 1.10.3 please!
Change the version and then terraform init -upgrade.
Change the version and then terraform init -upgrade.
Sure thing, on my way
@mysticaltech sorry for confusion. i've managed to left my helms untouched, here is the actual plan. Does it looks safe and gonna remove only one master node?
Terraform will perform the following actions:
# local_file.kubeconfig will be destroyed
- resource "local_file" "kubeconfig" {
- content = (sensitive value) -> null
- directory_permission = "0777" -> null
- file_permission = "0777" -> null
- filename = "/home/nbuashev/.kube/hetzner" -> null
- id = "aa5700bc5d5073740709b19c5974be5645d206c1" -> null
}
# module.kube-hetzner.hcloud_snapshot.autoscaler_image[0] will be destroyed
- resource "hcloud_snapshot" "autoscaler_image" {
- description = "Initial snapshot used for autoscaler" -> null
- id = "103210203" -> null
- labels = {
- "autoscaler" = "true"
- "cluster" = "botbiz-production-k8s"
- "engine" = "k3s"
- "provisioner" = "terraform"
} -> null
- server_id = 29666577 -> null
}
# module.kube-hetzner.local_file.kustomization_backup[0] will be destroyed
- resource "local_file" "kustomization_backup" {
- content = <<-EOT
"apiVersion": "kustomize.config.k8s.io/v1beta1"
"kind": "Kustomization"
"patchesStrategicMerge":
- |
apiVersion: apps/v1
kind: Deployment
metadata:
name: system-upgrade-controller
namespace: system-upgrade
spec:
template:
spec:
containers:
- name: system-upgrade-controller
volumeMounts:
- name: ca-certificates
mountPath: /var/lib/ca-certificates
volumes:
- name: ca-certificates
hostPath:
path: /var/lib/ca-certificates
type: Directory
- "kured.yaml"
- "ccm.yaml"
- "calico.yaml"
"resources":
- "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.14.1/ccm-networks.yaml"
- "https://github.com/weaveworks/kured/releases/download/1.12.2/kured-1.12.2-dockerhub.yaml"
- "https://raw.githubusercontent.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml"
- "hcloud-csi.yml"
- "traefik_ingress.yaml"
- "https://raw.githubusercontent.com/projectcalico/calico/v3.25.0/manifests/calico.yaml"
- "cert_manager.yaml"
EOT -> null
- content_base64sha256 = "IfadrAeHqYq5kmPHEYPl2ja5LLhrkM+BNPqUZm3MYuE=" -> null
- content_base64sha512 = "DHwgzdaNht95Zb2kiBKCoS0kqyKBsg/kId3CmXmL1LWhysrjFMe7ReuhCUsokJq5IbSFFcfVe787uiQ2OsvM5Q==" -> null
- content_md5 = "f1b53f4455e272a9104a45e823ded9fe" -> null
- content_sha1 = "bc9acc1c97b26d8f668f5134913c57e63d3eec48" -> null
- content_sha256 = "21f69dac0787a98ab99263c71183e5da36b92cb86b90cf8134fa94666dcc62e1" -> null
- content_sha512 = "0c7c20cdd68d86df7965bda4881282a12d24ab2281b20fe421ddc299798bd4b5a1cacae314c7bb45eba1094b28909ab921b48515c7d57bbf3bba24363acbcce5" -> null
- directory_permission = "0777" -> null
- file_permission = "600" -> null
- filename = "botbiz-production-k8s_kustomization_backup.yaml" -> null
- id = "bc9acc1c97b26d8f668f5134913c57e63d3eec48" -> null
}
# module.kube-hetzner.local_sensitive_file.kubeconfig[0] will be destroyed
- resource "local_sensitive_file" "kubeconfig" {
- content = (sensitive value)
- directory_permission = "0700" -> null
- file_permission = "600" -> null
- filename = "botbiz-production-k8s_kubeconfig.yaml" -> null
- id = "aa5700bc5d5073740709b19c5974be5645d206c1" -> null
}
# module.kube-hetzner.null_resource.agents["0-0-agent"] will be destroyed
- resource "null_resource" "agents" {
- id = "3012830062625593000" -> null
- triggers = {
- "agent_id" = "29666578"
} -> null
}
# module.kube-hetzner.null_resource.agents["0-1-agent"] will be destroyed
- resource "null_resource" "agents" {
- id = "5793387825081006408" -> null
- triggers = {
- "agent_id" = "29666576"
} -> null
}
# module.kube-hetzner.null_resource.agents["0-2-agent"] will be destroyed
- resource "null_resource" "agents" {
- id = "2083069504276235938" -> null
- triggers = {
- "agent_id" = "29915387"
} -> null
}
# module.kube-hetzner.null_resource.agents["0-3-agent"] will be destroyed
- resource "null_resource" "agents" {
- id = "4994720741323820131" -> null
- triggers = {
- "agent_id" = "29915384"
} -> null
}
# module.kube-hetzner.null_resource.configure_autoscaler[0] will be destroyed
- resource "null_resource" "configure_autoscaler" {
- id = "5520672443837788890" -> null
- triggers = {
- "template" = >>>EOT
EOT
} -> null
}
# module.kube-hetzner.null_resource.control_planes["0-0-control-plane"] will be destroyed
- resource "null_resource" "control_planes" {
- id = "7282537881054730684" -> null
- triggers = {
- "control_plane_id" = "29666577"
} -> null
}
# module.kube-hetzner.null_resource.control_planes["0-1-control-plane"] will be destroyed
- resource "null_resource" "control_planes" {
- id = "7092987203276066952" -> null
- triggers = {
- "control_plane_id" = "30069994"
} -> null
}
# module.kube-hetzner.null_resource.control_planes["0-2-control-plane"] will be destroyed
- resource "null_resource" "control_planes" {
- id = "2153595407490067548" -> null
- triggers = {
- "control_plane_id" = "30069993"
} -> null
}
# module.kube-hetzner.null_resource.first_control_plane will be destroyed
- resource "null_resource" "first_control_plane" {
- id = "4534639624935671890" -> null
}
# module.kube-hetzner.null_resource.kustomization will be destroyed
- resource "null_resource" "kustomization" {
- id = "2531306948694365826" -> null
- triggers = {
- "helm_values_yaml" = (sensitive value)
- "options" = ""
- "versions" = <<-EOT
v1.25.0
N/A
N/A
N/A
N/A
EOT
} -> null
}
# module.kube-hetzner.module.control_planes["0-0-control-plane"].hcloud_server.server will be destroyed
- resource "hcloud_server" "server" {
- allow_deprecated_images = false -> null
- backups = false -> null
- datacenter = "nbg1-dc3" -> null
- delete_protection = false -> null
- firewall_ids = [
- 764142,
] -> null
- id = "29666577" -> null
- ignore_remote_firewall_ids = false -> null
- image = "ubuntu-20.04" -> null
- ipv4_address = "5.75.160.152" -> null
- ipv6_address = "2a01:4f8:c2c:fe9c::1" -> null
- ipv6_network = "2a01:4f8:c2c:fe9c::/64" -> null
- keep_disk = false -> null
- labels = {
- "cluster" = "botbiz-production-k8s"
- "engine" = "k3s"
- "provisioner" = "terraform"
- "role" = "control_plane_node"
} -> null
- location = "nbg1" -> null
- name = "botbiz-production-k8s-control-plane-jlr" -> null
- placement_group_id = 134623 -> null
- rebuild_protection = false -> null
- rescue = "linux64" -> null
- server_type = "cpx31" -> null
- ssh_keys = [
- "10384165",
] -> null
- status = "running" -> null
- user_data = "/tzgSRzy9MJHfeGib23VfLR+aaU=" -> null
}
# module.kube-hetzner.module.control_planes["0-0-control-plane"].hcloud_server_network.server will be destroyed
- resource "hcloud_server_network" "server" {
- alias_ips = [] -> null
- id = "29666577-2605336" -> null
- ip = "10.255.0.101" -> null
- mac_address = "86:00:00:3b:4f:5d" -> null
- server_id = 29666577 -> null
- subnet_id = "2605336-10.255.0.0/16" -> null
}
# module.kube-hetzner.module.control_planes["0-0-control-plane"].null_resource.registries will be destroyed
- resource "null_resource" "registries" {
- id = "2799499625627435179" -> null
- triggers = {
- "registries" = " "
} -> null
}
# module.kube-hetzner.module.control_planes["0-0-control-plane"].random_string.identity_file will be destroyed
- resource "random_string" "identity_file" {
- id = "dc1ohfonfm1ezyf9b0a3" -> null
- length = 20 -> null
- lower = true -> null
- min_lower = 0 -> null
- min_numeric = 0 -> null
- min_special = 0 -> null
- min_upper = 0 -> null
- number = true -> null
- numeric = true -> null
- result = "dc1ohfonfm1ezyf9b0a3" -> null
- special = false -> null
- upper = false -> null
}
# module.kube-hetzner.module.control_planes["0-0-control-plane"].random_string.server will be destroyed
- resource "random_string" "server" {
- id = "jlr" -> null
- keepers = {
- "name" = "botbiz-production-k8s-control-plane"
} -> null
- length = 3 -> null
- lower = true -> null
- min_lower = 0 -> null
- min_numeric = 0 -> null
- min_special = 0 -> null
- min_upper = 0 -> null
- number = false -> null
- numeric = false -> null
- result = "jlr" -> null
- special = false -> null
- upper = false -> null
}
Plan: 0 to add, 0 to change, 19 to destroy.```
@bulnv The only thing important now is the autoscaler_image, so basically you add helm releases, those are not from the module, make sure they are not dependent on the autoscaler_image, there is an implicit or explicit dependency somewhere.
Please just destroy module.kube-hetzner.hcloud_snapshot.autoscaler_image[0]
, forget about the first control plane. And find the implicit or explicit dependencies that create the need to destroy your helm releases. You fix this, and you recreate the snapshot through terraform and it will be fixed for good!
@bulnv The reason why the autoscaler image is important to recreate is that the new version of the project holds the right cloud-init for it. So normally the first control plane node doesn't matter, as the new cloud-init reconfigures everything the right way so that k3s starts correctly on the autoscaled node!
(If you give me the full plan for destroy of module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], I can try passing it to GPT-4, and we ask it why the helm releases are to be destroyed).
If you have access to it yourself, the secret is to pass it in multiple chuncks.
(If you give me the full plan for destroy of module.kube-hetzner.hcloud_snapshot.autoscaler_image[0], I can try passing it to GPT-4, and we ask it why the helm releases are to be destroyed).
If you have access to it yourself, the secret is to pass it in multiple chuncks.
hah, that's a good one actually! luckily I was able destroy it by myself. And apply it. Before that I've tried to use hack with changing HCLOUD_IMAGE. All what I've get so far: node is booting up, k3s failed, but at least installed. Here is the output of the k3s unit log
systemctl[3305]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
k3s[3308]: time="2023-03-16T17:26:29Z" level=fatal msg="--token is required"
systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: k3s-agent.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Lightweight Kubernetes.
@bulnv Good progress. Now we need to figure out what's happening.
Please cd /etc/rancher/k3s/
, cat the config file and see if the token is present and if everything looks ok.
Also have a look at /var/pre_install/
see the content of the file there.
Run ip address show
, see if eth1 is present.
Post the output of journalctl -u k3s-agent
.
And also the content of /var/log/cloud-init/cloud-init-user.log
(if I remember correctly the path).
Somewhere above the problem should show up, please share the output if needed.
@bulnv Good progress. Now we need to figure out what's happening.
Please
cd /etc/rancher/k3s/
, cat the config file and see if the token is present and if everything looks ok.Also have a look at
/var/pre_install/
see the content of the file there.Run
ip address show
, see if eth1 is present.Post the output of
journalctl -u k3s-agent
.And also the content of
/var/log/cloud-init/cloud-init-user.log
(if I remember correctly the path).Somewhere above the problem should show up, please share the output if needed.
Heh! Working on it right now! what I've figured out so far:
/etc/rancher/k3s/ missing
/etc/rancher/k3s/config.yaml missing as well
content: ${base64encode(k3s_config)} encoding: base64 path: /etc/rancher/k3s/config.yaml
content: ${base64encode(k3s_registries)} encoding: base64 path: /etc/rancher/k3s/registries.yaml
maybe this parg of decoded HCLOUD_CLOUD_INIT somehow involved
❯ ip a
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 96:00:02:02:02:03 brd ff:ff:ff:ff:ff:ff
altname enp1s0
inet 128.140.3.5/32 scope global dynamic noprefixroute eth0
valid_lft 85604sec preferred_lft 85604sec
inet6 fe80::5493:ec5b:9b3e:4337/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
link/ether 86:00:00:3d:27:c0 brd ff:ff:ff:ff:ff:ff
altname enp7s0
inet 10.0.0.2/32 scope global dynamic noprefixroute eth1
valid_lft 85604sec preferred_lft 85604sec
inet6 fe80::c3a:f0de:792d:1f79/64 scope link noprefixroute
valid_lft forever preferred_lft forever
- journalctl -eu k3s-agent I gave in the previous message, nothing besides that
9 sh[1365]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 9 systemctl[1366]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory 9 k3s[1369]: time="2023-03-16T20:42:19Z" level=fatal msg="--token is required" 9 systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE 9 systemd[1]: k3s-agent.service: Failed with result 'exit-code'. lines 1-64
- cat /var/log/cloud-init-output.log
Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname uid = pwd.getpwnam(user).pw_uid KeyError: "getpwnam(): name not found: 'systemd-network'"
The above exception was the direct cause of the following exception:
Cloud-init v. 23.1-1.1 running 'init' at Mon, 06 Mar 2023 15:32:42 +0000. Up 21.54 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | eth0 | True | 5.75.160.152 | 255.255.255.255 | global | 96:00:01:f9:ab:df | ci-info: | eth0 | True | fe80::c784:df14:b615:4bfb/64 | . | link | 96:00:01:f9:ab:df | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | 0 | 0.0.0.0 | 172.31.1.1 | 0.0.0.0 | eth0 | UG | ci-info: | 1 | 172.31.1.1 | 0.0.0.0 | 255.255.255.255 | eth0 | UH | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | 1 | fe80::/64 | :: | eth0 | U | ci-info: | 3 | multicast | :: | eth0 | U | ci-info: +-------+-------------+---------+-----------+-------+ Generating public/private rsa key pair. Your identification has been saved in /etc/ssh/ssh_host_rsa_key Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub The key fingerprint is: SHA256:/mX33L/ySw0Fv5nifffjQzGBsCQYhY+qE+SmI7Y7S00 root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[RSA 3072]----+ | .=o o. o | o o .. + | o . + | . . . ++ | oE . S ..+o | o+ . . . oo. | .o.o . o.oo+ | o= o . o o.== | o+* . . +=% | +----[SHA256]-----+ Generating public/private dsa key pair. Your identification has been saved in /etc/ssh/ssh_host_dsa_key Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub The key fingerprint is: SHA256:Z2i1kCPIAJY1saWeE7wYAXySEjl6BYJv51LAKZB62cI root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[DSA 1024]----+ | @O**.. | O*+. . | +*oXo . + . | o E B . = . | = X S + | . o . o | . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
+----[SHA256]-----+ Generating public/private ecdsa key pair. Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub The key fingerprint is: SHA256:GDYiQdnLkURLtEBLjJYfsdOdjqUmUqpFPvtCNTfJH1Y root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +---[ECDSA 256]---+ | =BB*. | .+o**o. . E | . ====+= . | ooo*oX+o | o+o *.=S. | ..ooo . | ... | .. | .. | +----[SHA256]-----+ Generating public/private ed25519 key pair. Your identification has been saved in /etc/ssh/ssh_host_ed25519_key Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub The key fingerprint is: SHA256:Bu2s6sozf8LH6gzDpSCZ7RoqzbG4Z/6CwPTxfL8q9F4 root@botbiz-production-k8s-control-plane-jlr The key's randomart image is: +--[ED25519 256]--+ | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | ||||||||||||||||||
. . | ||||||||||||||||||
= . + | ||||||||||||||||||
B o = S | ||||||||||||||||||
o=.+ + + | ||||||||||||||||||
o=*+..+ .E | ||||||||||||||||||
=+X+oo+... | ||||||||||||||||||
==+XO=oo... |
Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/cloudinit/util.py", line 1608, in chownbyname uid = pwd.getpwnam(user).pw_uid KeyError: "getpwnam(): name not found: 'systemd-network'"
The above exception was the direct cause of the following exception:
Cloud-init v. 23.1-1.1 running 'init' at Thu, 16 Mar 2023 20:40:58 +0000. Up 10.53 seconds. ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: | enp7s0 | True | 10.0.0.2 | 255.255.255.255 | global | 86:00:00:3d:27:c0 | ci-info: | enp7s0 | True | fe80::58a2:41c7:301b:10b0/64 | . | link | 86:00:00:3d:27:c0 | ci-info: | eth0 | True | 128.140.3.5 | 255.255.255.255 | global | 96:00:02:02:02:03 | ci-info: | eth0 | True | fe80::9cfa:c039:f3b7:6596/64 | . | link | 96:00:02:02:02:03 | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++ ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | enp7s0 | UG | ci-info: | 1 | 0.0.0.0 | 172.31.1.1 | 0.0.0.0 | eth0 | UG | ci-info: | 2 | 10.0.0.0 | 10.0.0.1 | 255.0.0.0 | enp7s0 | UG | ci-info: | 3 | 10.0.0.1 | 0.0.0.0 | 255.255.255.255 | enp7s0 | UH | ci-info: | 4 | 172.31.1.1 | 0.0.0.0 | 255.255.255.255 | eth0 | UH | ci-info: +-------+-------------+------------+-----------------+-----------+-------+ ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++ ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+-------------+---------+-----------+-------+ ci-info: | 1 | fe80::/64 | :: | enp7s0 | U | ci-info: | 2 | fe80::/64 | :: | eth0 | U | ci-info: | 4 | multicast | :: | enp7s0 | U | ci-info: | 5 | multicast | :: | eth0 | U | ci-info: +-------+-------------+---------+-----------+-------+ Generating public/private rsa key pair. Your identification has been saved in /etc/ssh/ssh_host_rsa_key Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub The key fingerprint is: SHA256:D1VnepUBoe8uK/89aUMStLs7RBnRRcejBp/yyTakvEQ root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[RSA 3072]----+ | .+B+O| | o. =o| | ..= B .| | . E.% | | S o O.+ | | o +.X . | | o +.= .| | . o.o.= | | oo+++.o| +----[SHA256]-----+ Generating public/private dsa key pair. Your identification has been saved in /etc/ssh/ssh_host_dsa_key Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub The key fingerprint is: SHA256:TbbywS2yYdG+zXzJ9h5tcRazpmR4TUsslaTBWx2eP34 root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[DSA 1024]----+ | ...++| | . .oo| | . + ooO | | B o..= | | S B..+ B+| | . B+.+o=| | . o +.=oE| | o oo| | .o| +----[SHA256]-----+ Generating public/private ecdsa key pair. Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub The key fingerprint is: SHA256:5IThG4cF8lX1ZxNt8c+MwcTdf276z+I0iwPLLqNomnU root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +---[ECDSA 256]---+ | . o.o......++| | + +. B| | + +o=| | B B=| | . S ..=| | . o| | . E . o oo | | o.o o o .oo+ | | oo. .. +. .ooo=| +----[SHA256]-----+ Generating public/private ed25519 key pair. Your identification has been saved in /etc/ssh/ssh_host_ed25519_key Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub The key fingerprint is: SHA256:qA2W9qZ4VStAwiZoKPG76ZBLqDu1i1s0GjcqwlLL+6U root@botbiz-production-k8s-autoscaled-cpx31-21d252b9b56c96a9 The key's randomart image is: +--[ED25519 256]--+ |+o | |=o+ . | |oo.o | | ... .. | |. B =...S. | |oX.O =o . | |X+...=. | |B.o.= | |*+=+E | +----[SHA256]-----+ Cloud-init v. 23.1-1.1 running 'modules:config' at Thu, 16 Mar 2023 20:40:59 +0000. Up 11.62 seconds. Cloud-init v. 23.1-1.1 running 'modules:final' at Thu, 16 Mar 2023 20:40:59 +0000. Up 12.01 seconds. ++ id -u
zypper remove -y k3s-selinux
:
Reading installed packages...
Resolving package dependencies...The following package is going to be REMOVED: k3s-selinux
1 package to remove. After the operation, 95.0 KiB will be freed. Continue? [y/n/v/...? shows all options] (y): y (1/1) Removing k3s-selinux-0.0~bd1f1455dirty-0.sle.noarch [...done]
2023-03-16 20:41:10 Application returned with exit status 0. 2023-03-16 20:41:11 Transaction completed. 2023-03-16 20:41:11 tukit 4.1.3 started 2023-03-16 20:41:11 Options: --discard close 4 /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 2023-03-16 20:41:11 New default snapshot is #4 (/.snapshots/4/snapshot). 2023-03-16 20:41:11 Transaction completed.
Please reboot your machine to activate the changes and avoid data loss. New default snapshot is #4 (/.snapshots/4/snapshot). transactional-update finished
my guess is ${base64encode(k3s_config)} is not working inside of template file
It could be, but it's super weird because it works for in all my tests and for others too! 🤯
Please debug some more, at this point I am dry! If I can think of something I will let you know.
It could be, but it's super weird because it works for in all my tests and for others too! exploding_head
Please debug some more, at this point I am dry! If I can think of something I will let you know.
yeah, changed tmpl and redeployed, base64 looks valid, and this is strange. Looks like I missing something
@bulnv Try rebooting the node manually, see if it comes online!
It should reboot after cloud-init, maybe it your case maybe it's not rebooting, so maybe use the reboot command at the end of user-config, instead of the power state statement.
I am pretty convinced, if you reboot, it will join the cluster. I'm guessing the power state did not work for that old version of MicroOS. So we can use the reboot statement instead.
Also run journalctl -u install-k3s-agent
please.
Also run
journalctl -u install-k3s-agent
please.
cpx31-43ea3b9393f93f33 Ready <none> 51s v1.25.7+k3s1
Good to hear @bulnv, PR is most welcome, am curious about this, it could be an edge case!
@mysticaltech please see my changes here https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/660 yes, i agree this could be some edge case. But as far as I understood from the code the flow is following:
IMO copy configs goes bit early, and following uninstall-k3s script removes them (as I checked in the uninstall script code) https://github.com/ruifigueiredo/k3s/blob/master/uninstall.sh#L21 please see line 21. So I made a change to preserve configs from deletion during k3s-uninstall. Let me know WDYT?
@bulnv Yep exactly, that's what's happening. But in my testing it worked, so the timing must change depending on the machine, hence the bug went un-noticed until it showed up for you. Good catch!!
@bulnv Your fix was just released in 1.10.6 🙏🚀
Description
Hey! I am trying to spin up autoscaler on my cluster. The autoscaler pod is working fine. Servers are spawning when its needed, they even show up as green in hetzner console, but they aren't able to join the cluster or let me in via ssh. Please see the screenshot below what I was able to get from console. What are the options? can I login with the password on the console?
Kube.tf file
Screenshots
Platform
linux