openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.42k stars 1.38k forks source link

change bootstrap to pivot #2542

Open cgwalters opened 4 years ago

cgwalters commented 4 years ago

See this thread, specifically this comment.

TL;DR - today the installer launches a bootimage, which is usually the pinned RHCOS version (for IPI installs), but can be different - see this issue which is about making it easier for people to find the correct bootimage.

See also https://github.com/openshift/installer/pull/2532

Filing this issue to track changing the installer to do a pivot on the bootstrap node.

I think the architecture would look like a new bootstrap-pivot.service between release-image.service and bootkube.service - we'd run the MCD on the bootstrap host, telling it to pivot to the target machine-os-content.

In fact...if we did this, we could streamline down the bootimages - e.g. no reason to ship kubelet/cri-o in the bootimages. (Though, we'd need new terminology like "bootRHCOS" to distinguish from "normal" RHCOS in machine-os-content or so?) Anyways, not required for this change.

wking commented 4 years ago

Though, we'd need new terminology like "bootRHCOS"...

Red Hat Pivot OS? All this OS does is pivot to an OSTree pulled from a container image. It is not certified for any otger purpose.

vrutkovs commented 4 years ago

if we did this, we could streamline down the bootimages

That would be very helpful for OKD-on-FCOS, however that would require a few changes to RHCOS oscontainer build process.

LorbusChris commented 4 years ago

@vrutkovs @cgwalters this is implemented by Vadim's commit here, right? https://github.com/openshift/installer/pull/2548/commits/c7da4745829deae4cfeb87cfb885a3e6259f36a3

Should we try to get this into master soon? (i.e. before tackling spec3 for OCP)

abhinavdahiya commented 4 years ago

pullling the machine-os-content on the bootstrap host and pivoting adding downloading 800MB into the critical part of bootstrapping, and also causing a reboot of the bootstrap-host.

the bootstrap-host, doesn't use/need close tie to openshift binaries as control-plane host. we only use kubelet to run static pods, podman to run some pods.

So i don't get why there is requirement for pivot on the bootstrap-host??

vrutkovs commented 4 years ago

bootstrap node would use kubelet/crio/machine-config-daemon from original AMI. That means fixes to these components would not be applied during bootstrap phase - that might be critical for some deployments

wking commented 4 years ago

bootstrap node would use kubelet/crio/machine-config-daemon...

Is there an MCD baked into RHCOS? I'd be surprised if we ran one on the bootstrap machine. We certainly extract machine-config components from the target release image and run them, but the only bootimage exposure in that is podman/kubelet/crio (and I have no cost/benefit opinion of pivoting for those ;).

cgwalters commented 4 years ago

So i don't get why there is requirement for pivot on the bootstrap-host??

First, this helps OKD which will use FCOS, which won't include a kubelet by default (at least, not right now).

Second, it does help avoid "bootimage drift" issues with the installer as noted also in the initial comment.

Further, the initial comment links to https://github.com/openshift/enhancements/pull/78#discussion_r337137313 - so perhaps to disintermediate we can summon @smarterclayton

Is there an MCD baked into RHCOS?

Yes.

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 4 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

cgwalters commented 4 years ago

/remove-lifecycle stale

openshift-bot commented 4 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 4 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/2542#issuecomment-616896830): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
LorbusChris commented 4 years ago

/reopen

openshift-ci-robot commented 4 years ago

@LorbusChris: Reopened this issue.

In response to [this](https://github.com/openshift/installer/issues/2542#issuecomment-617113157): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
LorbusChris commented 4 years ago

/remove-lifecycle rotten

ashcrow commented 4 years ago

I'd love to see this happen. It would get us closer to boot images being "basic boot and pivot to expected content" in both the installer (bootstrap node) and the hosts themselves (which is the case today).

cgwalters commented 4 years ago

With some nontrivial but also not extremely difficult work, we could change the OS update stack to support an "update and restart all of userspace, but not the kernel" semantic which would shave some time off this.

cgwalters commented 4 years ago

pullling the machine-os-content on the bootstrap host and pivoting adding downloading 800MB into the critical part of bootstrapping

One thing also - it can't be that hard to teach the bootstrap host how to serve the images it pulled to the control plane - so if we did that it would reduce the 3 separate pulls of m-o-c from the upstream registry to one. (And similar for other images)

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

LorbusChris commented 3 years ago

/remove-lifecycle stale

travier commented 3 years ago

/lifecycle frozen