Closed maaft closed 6 months ago
For folks running into the same issue, please SSH into your nodes (see README, debug section) and run the following commands:
cloud-init single --frequency always --name write_files
cloud-init single --frequency always --name runcmd
@maaft @andi0b Currently thinking about implementing such a forced re-run on each reboot, but I do not know if that config will itself survive such a falty upgrade, probably not, but worth a try.
[Unit]
Description=Re-run cloud-init
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init clean --logs
ExecStart=/usr/bin/cloud-init init
ExecStart=/usr/bin/cloud-init modules -m final
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
Now, those falty ugprades are rare, but if they do happen we must find a solution.
Please let me know what you think!
Anyways, we are also planning on exploring to change the underlying nodes to be Talos ones, so that would solve all of this for good.
Anyways, we are also planning on exploring to change the underlying nodes to be Talos ones, so that would solve all of this for good.
funny - before I found your repo I already used Talos. Then decided against it because it did not have much automation. Looking forward to it!
Anyways, we are also planning on exploring to change the underlying nodes to be Talos
I'm not following the latest Linux trends, and I've never used transactional-update before I had to investigate this issue. Honestly it was quite an underwhelming experience. The documentation is between poor and non-existing and there is not a lot of activity in the corresponding github projects. I think what is happening here is caused by broken design of transactional-update.
@andi0b Probably not, it's a temporary bug. MicroOS is the open-source version of OpenSUSE SLE-Micro that open-suse uses for enterprise, so it's under active dev. The only thing is that like Fedora with regards to RedHat we get to be the ginny pigs 😅
@mysticaltech I know it's not abandoned, but I have the suspicion the enterprise version has some additional management on top of it, to circumvent this issue. The docs of transactional-update suggest a reboot asap after installing updates, and this is exactly what I didn't do (only rebooted after 40 days).
It's also not realistic, that every user will reboot immediately, there are always some workloads that will prevent a node reboot, and not everybody is going to notice that right away.
Talos looks promising, but like a big change, with a lot of new issues...
@andi0b Yep, interesting details you share here. The lapse in reboot time could explain the issue.
Talos, we will let all of you try first in a beta branch and gather feedback. But that will take some time to implement.
In the meantime, please let me know if that issue shows up again. I'm hoping it was just one bad apple update along the way.
And if you find a way to actively prevent it, please let me know. You seem close to deciphering the whole thing lol.
Interesting topic! What would be all the benefits and also drawbacks of replacing MicroOS with Talos?
Closing this as it was fixed in the latest versions of transactional-update.
Description
This means, that I can't do:
What I can do:
I also talked to hetzner support, they don't see any issues on there side and requested me to execute
dhclient eth0
due to my servers may not renew their DHCP address correctly. But that didn't help unfortunately.Any Idea?
I already checked selinux errors, but on one of the broken nodes no errors were displayed and on the other, a huuuuuuge list (which i posted already in pinned Selinux issue) was generated (after issuing
dhclient eth0
).Kube.tf file
Screenshots
No response
Platform
Linux, Arm64