Updating SSH keys on existing Machines

LittleFox94 commented 1 year ago

Hi,

it would be useful, if machine-controller could update the SSH keys deployed to the machines instead of rotating them with new keys.

Right now, the SSH keys are given via cloud-init (yea, deprecated, but still in there)/ignition, with the source of that being changed from templates in machine-controller to templates in operating-system-manager. Due to the nature of when those are applied, updating the SSH keys via those mechanisms is impractical to impossible.

KKP has a component called usersshkeysagent, syncing SSH keys onto machines of user clusters. This can already update the keys, maybe it would be practical to move that component into machine-controller and have KKP sync the SSH keys into the MachineDeployment resources instead of directly to the machines?

I think instead of using a DaemonSet to have an agent on each node (what usersshkeysagent is doing now), machine-controller could instead spawn a Job for each node when SSH keys have to be updated.

Do you think this is a worthwhile addition and change to machine-controller and KKP?
Do you see any problems with the proposed approachs?
Yes, we are thinking of implementing that ourselves :)

embik commented 1 year ago

Hi @LittleFox94, sorry for the late reply on this. This is just my personal view and does not necessarily represent Kubermatic's, but I think this approach might not be desired for machine-controller. In general, the idea behind Cluster API has always been "immutable infrastructure", so rotating machines frequently (e.g. to replace keys or to use a newer OS image) and normalising such frequent rotation is one of the core ideas behind machine-controller. The gist is, I believe the current approach to be "working as intended".

Since usersshkeysagent in KKP is pretty heavily tied to the platform, my personal recommendation would be to write a similar controller/agent running as a DaemonSet that doesn't need KKP. Implementing such an agent is pretty straightforward. The problem with a Job-based approach is that you might miss changes to the SSH authorized_keys file from outside, and you might want to watch that continuously to ensure compliance with the list of approved SSH keys in your own controller/agent implementation. But such an implementation would best live separate from machine-controller so it can be applied on any type of Kubernetes cluster.

kubermatic-bot commented 9 months ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

LittleFox94 commented 9 months ago

Hi @LittleFox94, sorry for the late reply on this.

Hi, thanks for your detailed reply and no worries for being late - I'm apparently just as late, time is flying by..

This is just my personal view and does not necessarily represent Kubermatic's, but I think this approach might not be desired for machine-controller. In general, the idea behind Cluster API has always been "immutable infrastructure", so rotating machines frequently (e.g. to replace keys or to use a newer OS image) and normalising such frequent rotation is one of the core ideas behind machine-controller. The gist is, I believe the current approach to be "working as intended".

I am familiar with that general idea and approach and, for the most part, I agree with it and think it's a good thing. However, I think this should not be an argument to not allow any changes to already-deployed infrastructure at all. I also like choice.

For some setups, regularly rotating the machines is cumbersome, non-ideal performance-wise or maybe just feasible to recover from a really broken state, maybe because manual work is required or customers yell at you when their apiserver is terminating a watch due to being moved to another node (heavy sigh). machine-controller already helps a bit for such cases with allowing to disable automated updates for Flatcar (immutable infrastructure would imho also mean "no flatcar updates" but instead rotating machines with the current template, so there is already a precedent in machine-controller for updating infrastructure instead of rotating).

Changing SSH keys is, imho, lot less of a change to already-deployed infrastructure than updating the OS of a machine and for some setups it might be a useful feature (it would be for us - we use openebs-lvm on machine-controller-managed machines and while we only store data on nodes that is replicated across multiple nodes, have PodDisruptionBudgets to ensure nodes can only be drained (and thus rotated-away/deleted) when the data is moved and have a controller to facilitate rotating nodes automatically, it's slow and unnecessary "scary" for just rotating some SSH keys). I agree however, that updating SSH keys on deployed machines maybe should not be a default behavior and definitely not an always-on behavior.

Does that change your view a bit? Do you have a better place where other opinions on this could be collected, if anyone else is interested in this at all?

Since usersshkeysagent in KKP is pretty heavily tied to the platform, my personal recommendation would be to write a similar controller/agent running as a DaemonSet that doesn't need KKP. Implementing such an agent is pretty straightforward. [..] But such an implementation would best live separate from machine-controller so it can be applied on any type of Kubernetes cluster.

Sure, having this as an extra DaemonSet independent from KKP would work. If we were to split the existing usersshkeysagent out of KKP into something independent and changed KKP to use that instead, would that be a change you'd (you personally and you Kubermatic) like?

The problem with a Job-based approach is that you might miss changes to the SSH authorized_keys file from outside, and you might want to watch that continuously to ensure compliance with the list of approved SSH keys in your own controller/agent implementation.

Fair point, I agree.

(I'll proof-read my reply in a bit, my computer decided to do funny things with my mouse input (not sure if Chromium, Sway, Mouse, USB Dongle for it, ...) and I first want to ensure not loosing my text here^^' huh, funny: https://github.com/swaywm/sway/issues/4261#issuecomment-1842485027)

embik commented 9 months ago

Does that change your view a bit? Do you have a better place where other opinions on this could be collected, if anyone else is interested in this at all?

I totally agree that "runtime configuration" (like SSH keys) could and maybe should be considered exempt from immutable infrastructure. But I think that also strengthens the point of SSH key management happening in a separate component, since machine-controller also doesn't mange Flatcar updates (instead, you need to deploy flatcar's update operator).

Sure, having this as an extra DaemonSet independent from KKP would work. If we were to split the existing usersshkeysagent out of KKP into something independent and changed KKP to use that instead, would that be a change you'd (you personally and you Kubermatic) like?

We for sure wouldn't mind a "fork" of the functionality based on the KKP code as long as its license is honoured, but at the moment our focus is elsewhere since the functionality is "good enough" for KKP, plus migration logic would be necessary etc. We can maybe revisit your solution at a later point (assuming it's going to be publicly available) and see if we can bring it under the Kubermatic umbrella if you are interested, but at the moment that would be unlikely to happen soon.

huh, funny: https://github.com/swaywm/sway/issues/4261#issuecomment-1842485027

Great choice of WM! Loved using sway back in the days 😄

kubermatic-bot commented 8 months ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubermatic-bot commented 7 months ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubermatic-bot commented 7 months ago

@kubermatic-bot: Closing this issue.

In response to [this](https://github.com/kubermatic/machine-controller/issues/1697#issuecomment-1925996538): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubermatic / machine-controller

Updating SSH keys on existing Machines #1697