rancher / system-upgrade-controller

In your Kubernetes, upgrading your nodes
Apache License 2.0
759 stars 86 forks source link

Upgrading Fedora CoreOS hosts #87

Open masterzen opened 4 years ago

masterzen commented 4 years ago

This is more a prospective ticket than an actual feature request, and to gather insight about using SUC in this context.

There's currently a lack of option regarding upgrading a fleet of Fedora CoreOS based kubernetes cluster, whereas we had coreos/container-linux-update-operator for CoreOS.

FCOS uses the coreos/zincati tools check with a Cincinnati server for new upgrades and to download/apply them. The process can be controlled by another protocol called FleetLock that can ensure only one host reboots at a time. The problem is that this solution is not kubernetes aware, unlike SUC is. That's the reason when we migrated to FCOS that we based our auto-upgrade system on SUC.

I've used it successfully to upgrade FCOS hosts by using an upgrade script that calls the rpm-ostree tool (the underlying upgrade command). Unfortunately, this doesn't work completely (see coreos/fedora-coreos-tracker#536 for a discussion of the issue). Notably the issue is that the last step of my naive upgrade script automatically executes a reboot which kills the container. The job is consequently marked as in error, and when the node reboots, the container is rescheduled again (hopefully not doing anything). This increases the time it takes to rollout an upgrade. I have yet to find a solution to this issue, because there's a race between making sure the machine reboots (to apply the update) and signalling that the update has been performed correctly to SUC.

The other issue is that I have to maintain manually the version number to upgrade to in the Plan. So for instance if there's a new FCOS version, I manually update thespec.version plan field to trigger the upgrade.

My initial plan was to develop a service that would on one side implement the SUC channel system and on the other side the Cincinnati protocol so that plan would be triggered when the Cincinnati server would report the existence of a new version.

In retrospect, I'm wondering if this shouldn't be part of SUC itself, instead of being in another service. Would you accept a PR to implement a configurable channel system, where one of the implementation would be the Cincinnati protocol ?

In short, beside my questions above, I'm wondering how we can better connect FCOS upgrade tools (Zincati, etc) to SUC to build a powerful k8s based FCOS upgrade system.

Thanks,

/cc @lucab

bitfisher commented 1 year ago

@masterzen I'm looking for a similar solution.

Initially i thought fleetlock is the way to go but unfortunately it lacks support of maintenance window.

zincati also doesn't support a combination of fleet_lock and perodic strategy. There is an open feature request for this in https://github.com/coreos/zincati/issues/1014.

Another possible solution which came to my mind is using systemd-timers and zincati with fleet_lock. One timer (maintencance window start) will start zincati and one timer (maintenance window end) will stop zincati. But this approach seems to be more than ugly.

Did you manage to have a proper solution using system-upgrade-controller?

Would you mind sharing your plan and actual update script?

brandond commented 1 year ago

sounds like some overlap with https://github.com/rancher/system-upgrade-controller/issues/63 ?

bitfisher commented 1 year ago

@brandond yes, you are right! thanks for the pointer to kured ;) this wasn't on my radar yet.

craigcabrey commented 6 months ago

for what it's worth, I maintain a fork of fleetlock that adds a simple maintenance window: https://github.com/craigcabrey/fleetlock