kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.58k stars 1.31k forks source link

v1alpha4: write sentinel file as part of bootstrap #3716

Closed CecileRobertMichon closed 3 years ago

CecileRobertMichon commented 4 years ago

original discussion: https://kubernetes.slack.com/archives/C8TSNPY4T/p1599744069010700 /cc @jdef

See also: https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/603 proposal: https://docs.google.com/document/d/1U0GxvO6ltgIINMjpQz96UD4bExlN2h21wyyn-3ENezc/edit https://docs.google.com/document/d/1FVRxo9toKSUmvKIUFFzPFhnFrfdR9s7S6Bl4shovNlg/edit#heading=h.3mwmvwsf4jyi Sept 16 2020 office hours

Proposal: change the bootstrap provider contract to include writing a sentinel file at a specific (or possibly user-configurable?) location. For example, for CABPK, this would look like adding a ~touch $filepath~ echo success > filepath (to be compatible with Windows) after the kubeadm init or join command in the script that is written as Boostrap Data.

This would not completely solve the problem of bootstrap failure detection but provides a clear signal to the infra provider that bootstrap is complete and the infrastructure provider can then take it from there and use infra specific mechanisms to read that signal.

/kind feature /kind proposal /milestone v0.4.0

jdef commented 4 years ago

For example, for CABPK, this would look like adding a touch $filepath after the kubeadm init or join command in the script that is written as Boostrap Data.

Should this be clarified? e.g. only touch the sentinel file once the kubeadm init or kubeadm join command has completed successfully.

User-configurability isn't very important for my use case. That said, CAPBK already makes assumptions about writing to /tmp, right? It would be nice if the contract here didn't make any assumptions about filesystem existence other than what is already assumed.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fabriziopandini commented 3 years ago

/remove-lifecycle stale

CecileRobertMichon commented 3 years ago

/assign

fabriziopandini commented 3 years ago

background question: does this a problem falls into the node agent responsibility? cc @randomvariable

vincepri commented 3 years ago

It should probably be a contract respected in all bootstrap providers

CecileRobertMichon commented 3 years ago

Yes, it's up to each bootstrap provider to determine what counts as "successful" bootstrapping. For example, for CABPK it would be if kubeadm init or kubeadm join exits 0. It's up to each bootstrap provider to get the file on the VM according to its own implementation.

The presence of the file itself is what should be part of the contract.

fabriziopandini commented 3 years ago

Ok thanks! I'm looking forward to the node agent design doc to get all the pieces together and understand if/how this will impact bootstrap providers as well

randomvariable commented 3 years ago

I know we don't have a label for it, but just for tracking

/area node-agent

k8s-ci-robot commented 3 years ago

@randomvariable: The label(s) area/node-agent cannot be applied, because the repository doesn't have them

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/3716#issuecomment-762814879): >I know we don't have a label for it, but just for tracking > >/area node-agent Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.