During our research for https://github.com/stackabletech/t2/issues/368, we tried to experiment with 2 long-running K3s clusters.
Unfortunately, they did not really run for a long time but crashed pretty soon.
Symptoms:
Nearly every test ran into timeouts
The orchestrator as well as some of the agent-nodes were not reachable, even with SSH
The Hetzner graphs showed that the CPU power was hitting its (physical) limit
During our research for https://github.com/stackabletech/t2/issues/368, we tried to experiment with 2 long-running K3s clusters. Unfortunately, they did not really run for a long time but crashed pretty soon.
Symptoms:
The post-mortem analysis was not so easy because the journals were gone after reboot (see https://github.com/stackabletech/infrastructure/issues/59)
In this task, we should:
Storage=persistent
, see https://www.freedesktop.org/software/systemd/man/journald.conf.html)