kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.14k stars 39.68k forks source link

Remove the /cluster directory #78995

Open timothysc opened 5 years ago

timothysc commented 5 years ago

For years we have publicly stated that the /cluster directory is deprecated and not maintained. However every cycle it is updated and there are bugs found+fixed by sig-cluster-lifecycle.

I'd like to enumerate what needs to get done in order for us to wholesale remove the /cluster directory.

/assign @dims @spiffxp @justinsb @timothysc
/cc @liggitt @neolit123

dims commented 5 years ago

/area code-organization

timothysc commented 5 years ago

/cc @andrewsykim

neolit123 commented 5 years ago

or potentially break it down and move certain sub-folders out of tree. currently it contains a collection of items:

liggitt commented 5 years ago

looking at the references to it...

/sig testing for e2e bringup

/sig scalability for kubemark bringup

sftim commented 5 years ago

Is this also relevant to SIG Docs?

timothysc commented 5 years ago

xref -https://github.com/kubernetes/kubernetes/pull/78543 ^ Is an example of continued technical debt that we see and have to pay for in different ways across SCL.

jaypipes commented 5 years ago

Is this also relevant to SIG Docs?

Yes, I think this is a good example of where /cluster scripts are referenced from the docs: https://github.com/kubernetes/website/pull/14929

andrewsykim commented 5 years ago

For v1.16: investigate if cluster API is a potential replacement and meets the same level of coverage as /cluster and enumerate what is missing cluster API

/assign @alejandrox1

dims commented 5 years ago

Step 1 we talked about was ... look at all the CI jobs that use the kube-up in the cluster directory and inventory the knobs/settings/configurations they setup or use and cross check if cluster-api/kubeadm allows us to do the same. Step 2 mirror the GCE e2e CI Job (pull-kubernetes-e2e-gce to be specific) using cluster-api for AWS.

Both steps can be done in parallel and will need help/effort/coordination on the wg-k8s-infra and sig-testing folks

alejandrox1 commented 5 years ago

I'll start with step 1 right away. So step 2 is free if anyone wants to work on this as well 😃

dims commented 5 years ago

@alejandrox1 sounds good. please start a google doc or something that we can use to compile notes on the various flags/options

alejandrox1 commented 5 years ago

On the google doc: https://docs.google.com/document/d/1p3c_sOALbEzg2VH2OPz3w9yKwlwaU4jATyS0lqxFzt4/edit?usp=sharing

Still got a lot to do for step 1 but will ping when this is ready /cc @mariantalla

mariantalla commented 5 years ago

Could I start work on step 2 (i.e. investigating/starting the refactor of pull-kubernetes-e2e-gce to use Cluster API)?

I'll assign myself, but please feel free to unassign me if someone else is already working on it!

/assign

justaugustus commented 5 years ago

Go for it, @mariantalla! :)

dims commented 5 years ago

yes please @mariantalla !

neolit123 commented 5 years ago

i have created a ticket for tracking the replacement of kube-up test jobs with Cluster API jobs: https://github.com/kubernetes/kubernetes/issues/82532

cc @mariantalla

dims commented 5 years ago

Update: CAPG now as a presubmit and a periodic job that starts up a cluster and runs the e2e/conformance tests

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-gcp#e2e%20tests https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-gcp#pr-e2e

pjh commented 5 years ago

I'd like to enumerate what needs to get done in order for us to wholesale remove the /cluster directory.

Clusters with Windows nodes on GCE depend on the cluster/gce/ kube-up code. This dependency was noted in https://github.com/kubernetes/kubernetes/issues/82532#issue-491700946, but I don't see Windows mentioned in this issue yet or in the doc @alejandrox1 mentioned above.

That other issue says:

this is one of the trickier items actually, since CAPI does not support Windows.

neolit123 commented 5 years ago

@pjh

since CAPI does not support Windows.

yet, this support is planned in the future.

markjacksonfishing commented 5 years ago

@spiffxp @timothysc @justinsb @dims @mariantalla Bug triage for 1.17 here with a gentle reminder that code freeze for this release is on November 18. Is this issue still intended for 1.17?

alejandrox1 commented 5 years ago

/milestone v1.18

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

detiber commented 4 years ago

/lifecycle frozen

jtslear commented 4 years ago

Hello, Bug Triage team here for the 1.18 release. It appears that linked PR https://github.com/kubernetes/kubernetes/issues/82532 has been bumped to milestone 1.19. I'm proceeding to do the same on this issue. Should anyone disagree with this, please feel free to re-assign.

/milestone v1.19

spiffxp commented 4 years ago

/milestone clear I don't think we're committing to this for v1.19

neolit123 commented 4 years ago

PRs to move the conformance image: https://github.com/kubernetes/kubernetes/pull/93937 https://github.com/kubernetes/test-infra/pull/18799

cc @dims @BenTheElder

BenTheElder commented 4 years ago

commented already 🙃 On Wed, Aug 12, 2020 at 10:30 AM Lubomir I. Ivanov notifications@github.com wrote:

PRs to move the conformance image:

93937 https://github.com/kubernetes/kubernetes/pull/93937

kubernetes/test-infra#18799 https://github.com/kubernetes/test-infra/pull/18799

cc @dims https://github.com/dims @BenTheElder https://github.com/BenTheElder

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/78995#issuecomment-673010267, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADK33YPDJK4JUBRPNA3LSALGUJANCNFSM4HX54RKQ .

neolit123 commented 4 years ago

/sig cloud-provider /area provider/gcp

cheftako commented 3 years ago

/assign /triage accepted cc @jpbetz

jpbetz commented 3 years ago

The etcd images are not kubernetes specific. The main thing they do is automatically upgrade etcd one minor at a time, per etcd administration guidelines, when the cluster administrator upgrades etcd to a new version. E.g. if the cluster administrator upgrades to etcd 3.4 and the cluster is currently on 3.1, it upgrades first to 3.2 and then to 3.3 before upgrading to 3.4.

So I'm tempted to ask if the etcd community would be willing to own this. If the etcd community was okay with this, the main issue to solve is that the images are published to the k8s.io container repo.

cc @gyuho, @ptabor, @wenjiaswe

gyuho commented 3 years ago

The etcd images are not kubernetes specific

Yeah, I think we can implement some mechanism to update docs and container images in the registry for Kubernetes as part of etcd release process.

@ptabor @wenjiaswe Any thoughts?

justinsb commented 3 years ago

We could also do this in the etcdadm project, as that is a kubernetes-sigs project and is thus set up to push to k8s repos / follow k8s governance etc.

ptabor commented 3 years ago

The northstar IMHO is that we should get rid of process of 'updating' etcd by running consecutive minor versions of etcd and instead etcd should have a dedicated tool: e.g. etcdstoragectl migrate (I would call it etcdadm, but the name is taken ;) ) process that knows the DB changes between different versions and explicitly 'fixes' the database instead of depending on the running for undefined duration full 'historical' etcd servers.

There are multiple benefits of this:

Maybe its naive, but in scope of etcd-v3 minor versions, during work on etcd storage-format documentation I haven't spotted any significant differences in the format. It's more like: 'set a default' if given field is missing, style of O(5) rules.

jpbetz commented 3 years ago

The northstar IMHO is that we should get rid of process of 'updating' etcd by running consecutive minor versions of etcd

I'm a big fan of this approach. If we can get support for this one the etcd side then the solution w/r/t the /cluster direct might just to be drop these images and use upstream etcd.

Curious what others think.

gyuho commented 3 years ago

w/r/t the /cluster direct might just to be drop these images and use upstream etcd

Yeah always wonder why we even need two separate etcd registries. I am open to dropping the whole container support in our release process, and let downstream project build their own. Or merge onto Kubernetes-community managed registry.

aojea commented 3 years ago

I think that what has to be done is to have an alternative as reliable as cluster or better, I'm not saying that cluster is great, but I didn't see anything better so far ... and this is very easy to measure with testgrid

k8s-triage-robot commented 1 year ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dims commented 1 year ago

Sketching up requirements: https://hackmd.io/pw1kt61lRM-wZh5MU1G_QQ

Here's a snapshot as of Feb27 just to be safe, please see all the comments on the hackmd.

# Requirements for cluster/kube-up.sh replacement

We have a lot of choices in the community like Kops/Kubespray/CAPA etc but all of them take too long or do not support all test sceanarios or hard to inject freshly built code. Hence the search for a new replacement. As you can see from this is a [long standing issue](https://github.com/kubernetes/kubernetes/issues/78995) and one that open to "easy" fixes.

- Must support 80% of jobs today (revisit all the environment flags we use to control different aspects of the cluster to verify)
- All nodes must run on a VM to replicate (how it is being done today, we already have `kind` to replicate things run inside a container)
- Must be able to deploy the cluster built directly from either a PR or the tip of a branch (to cover both presubmit and periodic jobs)
- Must use `kubeadm` to bootstrap the both the control plane node and the worker nodes
- `kubeadm` needs systemd for running `kubelet` so the images deployed should use systemd
- Must have a mechanical way to translate existing jobs to this new harness
- Should have a minimum of moving parts to ensure we are not chasing flakes and digging into things we don't need to
- Should have a clean path (UX) to debug things like we have today (logs from VM/cloudinit/systemd/kubelet/containers should tell the whole story)
- Should work on GCP and AWS, with Azure closely behind (so should be pluggable for other clouds)
- Should not be a hack, we need this to sustain us for 5-8 years at least. 
- Should work with things we already have like prow+boskos
- Adding a new service that is always on (like prow/boskos) should be well thought out as it is not trivial to debug yet another thing. Any such solution must be rock solid.
- we should be able to test kubeadm, external cloud providers (CPI) and storage drivers (CSI). 
- should support switching between containerd and CRI-O as well as cgroup v1/v2

We would need this solution to run in parallel for a full release cycle and offer equal or better results. This will need active owners who can take care of it for the long haul as worst case we will drop it like a hot potato however elegant or technologically superior the solution is.

Forward looking ideas that can be explored:
- Making whatever changes are needed to those tools to boot faster / allow injection of code.
BenTheElder commented 1 year ago
  • Must use kubeadm to bootstrap the both the control plane node and the worker nodes

I'm not sure this should be a hard requirement, as much appreciation as I have for kubeadm, it does a relatively small portion of cluster bootstrapping and we often need to test things it does not do / reconfigure things it does not directly support configuring anyhow. I'd suggest using kubeadm if we staffed writing something new from scratch, but I don't think we should preclude pre-existing options based on it.

We have a lot of choices in the community like Kops/Kubespray/CAPA etc but all of them take too long or do not support all test sceanarios or hard to inject freshly built code. Hence the search for a new replacement. As you can see from this is a long standing issue and one that open to "easy" fixes.

I'm also not sure the statement about kOps in particular is accurate, we actually used to run kOps on Kubernetes PRs as recently as 2019 and it worked fine. We lost this due to the AWS bill going unpaid somewhere between Amazon and the CNCF leading to the account being terminated and no ability to run the job. https://github.com/kubernetes/kubernetes/issues/73444#issuecomment-458397963

We didn't spin up kops-GCE instead because we already had cluster-up and there was a strong push for Cluster-API, but kops-GCE works and passes tests. It's also relatively mature. https://testgrid.k8s.io/kops-gce#kops-gce-latest

kubespray does not support kubernetes @ HEAD and CAPA is not passing tests / adds overhead like the local cluster, AIUI.

kops is already supported by kubetest + kubetest2, boskos, etc. and is passing tests on GCE + AWS. In theory it supports other providers but I don't think we have any CI there yet.

  • Must have a mechanical way to translate existing jobs to this new harness

This seems impractical. Most of the tricky part here is jobs setting kube-up environment variables that reconfigure the cluster. A human will have to dive into the scripts and see what each env ultimately does and then remap it to any other tool used. Unless that tool is also written in bash, we won't have drop in identical behavior, a lot of it is expanding variables inside bash-generated configuration files.

We could target the most important jobs and work from there.

aojea commented 1 year ago

@upodroid brought a related topic today, he is working on adding ARM64 support to the CI and modifyng the cluster scripts to do that https://github.com/kubernetes/kubernetes/pull/120144

I can see kops already is running arm64 (@justinsb) https://testgrid.k8s.io/google-aws#kops-aws-arm64-ci I rather prefer these news CI jobs to use a new tool instead of keep building on the cluster folder, if possible, or we'll never break this loop

upodroid commented 1 year ago

I'll create the arm64 equivalent of https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd on GCE using kops and then have a conversation about migrating the amd64 one to kops.

neolit123 commented 1 year ago

I'll create the arm64 equivalent of https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd on GCE using kops and then have a conversation about migrating the amd64 one to kops.

for k0ps related questions we have #kops-dev on k8s slack.

dims commented 10 months ago

@upodroid how close are we to doing this? 1.31 perhaps?