move rhel7 worker node updates entirely to openshift-ansible

cgwalters commented 4 years ago

It's really confusing today the role and split between openshift-ansible and the MCO (we need a FAQ for this).

Anyways here's an idea: what if we stopped the MCD running on !CoreOS systems, and instead openshift-ansible always needed to be run to apply changes from MachineConfig. It'd be openshift-ansible that'd somehow trigger drain/reboot etc.

The downside of this is e.g. if a user wants to change the kubelet config, they oc edit but then need to bounce out of Kube down to Ansible+ssh.

But the massive benefit is that it's super clear how things work - the MCO can optimize for CoreOS, openshift-ansible optimizes fully for traditional.

And with PRs like https://github.com/openshift/machine-config-operator/pull/1586/ for example - we can rely on /usr/etc for original files, and do things like make better use of OSTree.

jlebon commented 4 years ago

A random idea I discussed with @runcom which is less radical but more cumbersome is we detect the case where the cluster is only RHCOS nodes and activating "OSTree-optimized" mode where we can use things like /usr/etc. But yeah, not fun to maintain multiple paths to do the same thing.

sdodson commented 4 years ago

Anyways here's an idea: what if we stopped the MCD running on !CoreOS systems, and instead openshift-ansible always needed to be run to apply changes from MachineConfig. It'd be openshift-ansible that'd somehow trigger drain/reboot etc.

Do you specifically mean running MCD as a daemon or would even the once-from mode be eliminated outside of CoreOS?

/cc @crawford

cgwalters commented 4 years ago

Do you specifically mean running MCD as a daemon or would even the once-from mode be eliminated outside of CoreOS?

~Just the MCD as daemon.~

Well, that only kind of helps. To really make this work we may need to do something more like have code that translates the MCs into Ansible or so, and then that code lives in openshift-ansible.

So it would probably really be both, but we could also move e.g. the MCD once-from code into a Go library or binary that openshift-ansible would use.

cgwalters commented 4 years ago

One thing I could imagine is that openshift-ansible owns a "translator" from MachineConfig into Ansible...e.g. if kernelArguments in a MC are set that could be a playbook that calls out to grubby.

But OTOH, maybe the core problem for openshift-ansible to solve is updating host config files that relate to OpenShift (kubelet configs, etc.) and leave basically everything else to external tooling.

cgwalters commented 4 years ago

To elaborate on this, today openshift-ansible applies machineconfig updates - in doing this we'd be making openshift-ansible work the same way the MCO does: we don't distinguish between "configuration change" and "upgrade". And not making that distinction is the correct thing to do because it ensures there's only one code path which needs to work.

cgwalters commented 4 years ago

One other thing about this is that if we stop running the MCD on rhel7, then we can move the MCD container image to RHEL8, which would make it significantly saner to also extract the binary to the host and run it there.

michaelgugino commented 4 years ago

The suggestions here seem to be going in the complete opposite direction. We should be making efforts to manage more within the cluster, not pushing more into openshift-ansible.

cgwalters commented 4 years ago

The suggestions here seem to be going in the complete opposite direction. We should be making efforts to manage more within the cluster, not pushing more into openshift-ansible.

But people using BYO RHEL7 are presumably using it precisely because they have nontrivial investment in existing external management systems, non-containerized system software agents etc. That just directly conflicts with moving things into the cluster and containerizing - the more you do that the more the question becomes "why aren't you using RHCOS"?

michaelgugino commented 4 years ago

But people using BYO RHEL7 are presumably using it precisely because

I don't agree with your conclusions. I think the primary reason to use RHEL7 is hardware/software/compliance that is only certified/works for/on RHEL7. That's why we only support RHEL7 workers, to enable workloads that wouldn't other wise be possible.

IMO, there's no reason we can't support RHEL with MCO/MCD the same way we support RHCOS.

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/machine-config-operator/issues/1592#issuecomment-783793588): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / machine-config-operator

move rhel7 worker node updates entirely to openshift-ansible #1592