add dry run e2e tests - Githubissues

neolit123 commented 2 years ago

kubeadm is currently missing integration / e2e tests for --dry-run. this means if we happen to break our dry run support for a particular command (e.g. init) we will not know about it until users report it to us.

xref https://github.com/kubernetes/kubeadm/issues/2649

kubeadm has integration tests here: https://github.com/kubernetes/kubernetes/tree/master/cmd/kubeadm/test/cmd these tests execute a precompiled kubeadm binary to perform some checks and look for exist status 0.

we can use the same method for the init, join, reset tests with --dry-run. because the dry-run will be reentrant.

but we cannot use this method for upgrade * commands, because the --dry-run for upgrade expects an existing cluster.

kubeconfig files in /etc/kubernetes/...
a running kube-apiserver
other running components...

to have everything in the same place we can add dry-run as part of kinder e2e test workflow: https://github.com/kubernetes/kubeadm/tree/main/kinder/ci/workflows

the workflow can look like the following:

allocate a kinder cluster with 1 node
call kubeadm init --dry-run on it (add --upload-certs and other special flags, how to test external CA?)
callkubeadm join --dry-run on it (add --control-plane, --certificate-key and other flags?)
call kubeadm reset --dry-run on it
call kubeadm init ... to create an actual k8s node
call kubeadm upgrade apply --dry-run to dry run the "primary node" upgrade of this node
call kubeadm upgrade node --dry-run to dry run the "secondary node" upgrade of this node
.. cleanup

tasks:

[x] add initial e2e test job for dry-run k/kubeadm: https://github.com/kubernetes/kubeadm/pull/2673 k/test-infra PR: https://github.com/kubernetes/test-infra/pull/25997
[ ] in k/k, improve kubeadm's join to use a custom dry-run client as noted here TODO
[ ] update e2e test jobs to have join dry-run as noted here TODO

SataQiu commented 2 years ago

@neolit123 It seems that we cannot run kubeadm join --dry-run without an actual Kubernetes control-plane. :sweat_smile:

The join phase will try to fetch the cluster-info ConfigMap even though in dry-run mode.

I0412 10:01:39.066015     149 join.go:530] [preflight] Discovering cluster-info
I0412 10:01:39.067243     149 token.go:80] [discovery] Created cluster-info discovery client, requesting info from "127.0.0.1:6443"
I0412 10:01:39.101380     149 round_trippers.go:553] GET https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s  in 10 milliseconds
I0412 10:01:39.104282     149 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 127.0.0.1:6443: connect: connection refused
I0412 10:01:45.026633     149 round_trippers.go:553] GET https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s  in 5 milliseconds
I0412 10:01:45.027445     149 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 127.0.0.1:6443: connect: connection refused

Therefore, we need at least one worker node to complete the dry-run tests.

neolit123 commented 2 years ago

Hm, i think we should fix that and use a fake CM or the dry run client...In dryrun other API calls work like that. But if using the dry run client it means we will probably will have to skip validation of the CM as well 🤔

If you prefer we can merge the initial test job without the join test but it seems we have to fix it in k/k eventually. . EDIT: Or maybe ... join does need a control plane and it will fail later even if we use fake cluster info?

SataQiu commented 2 years ago

Emm... It doesn't look like the kubeadm join --dry-run command is using a fake client at all. https://github.com/kubernetes/kubernetes/blob/7380fc735aca591325ae1fabf8dab194b40367de/cmd/kubeadm/app/cmd/join.go#L551-L563

https://github.com/kubernetes/kubernetes/blob/7380fc735aca591325ae1fabf8dab194b40367de/cmd/kubeadm/app/discovery/token/token.go#L53

https://github.com/kubernetes/kubernetes/blob/7380fc735aca591325ae1fabf8dab194b40367de/cmd/kubeadm/app/discovery/token/token.go#L204-L209

...

Maybe it requires a major refactor to use a fake client for kubeadm join --dry-run :(

neolit123 commented 2 years ago

i think the refactor is doable. we need to rename the "init" dry-run client to be a generic one and use it for "join" as well. it's probably not that much work, but i haven't looked at all the details.

we can merge the current PR, but keep this issue open until we can do that in k/k after code freeze for 1.24.

neolit123 commented 2 years ago

test job is passing https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-dryrun-latest

neolit123 commented 2 years ago

@SataQiu looks like the current e2e tests are flaky.

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-dryrun-latest

the error is:

preflight] Some fatal errors occurred: [ERROR CRI]: container runtime is not running: output: time="2022-04-20T18:56:52Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded" , error: exit status 1

my guess is that we are running kubeadm ... inside the nodes before the container runtime has started. i can't remember if kinder's kubeadm-init action has a "wait for CRI" of sorts, but likely it does. one option would to add e.g. sleep 10 before the first kubeadm init in task-03-init-dryrun.

EDIT: unclear if sleep is in the node images, possibly yes.

alternatively this could be a weird bug where containerd in the nodes is simply refusing to start for some reason, and despite:

I0420 20:57:18.894578 107 initconfiguration.go:117] detected and using CRI socket: unix:///var/run/containerd/containerd.sock

SataQiu commented 2 years ago

@SataQiu looks like the current e2e tests are flaky.

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-dryrun-latest

the error is:

preflight] Some fatal errors occurred: [ERROR CRI]: container runtime is not running: output: time="2022-04-20T18:56:52Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded" , error: exit status 1

my guess is that we are running kubeadm ... inside the nodes before the container runtime has started. i can't remember if kinder's kubeadm-init action has a "wait for CRI" of sorts, but likely it does. one option would to add e.g. sleep 10 before the first kubeadm init in task-03-init-dryrun.

EDIT: unclear if sleep is in the node images, possibly yes.

alternatively this could be a weird bug where containerd in the nodes is simply refusing to start for some reason, and despite:

I0420 20:57:18.894578 107 initconfiguration.go:117] detected and using CRI socket: unix:///var/run/containerd/containerd.sock

It looks like the kubeadm-kinder-dryrun job was deleted by https://github.com/kubernetes/test-infra/commit/c69405228b92b7037662b1f7e89db34350f5832f :sweat:

neolit123 commented 2 years ago

Oh looks like @RA489 's PR deleted it in the 1.24 updates and i didn't see that..

Can you please send it again.

SataQiu commented 2 years ago

Oh looks like @RA489 's PR deleted it in the 1.24 updates and i didn't see that..

Can you please send it again.

Sure！

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

RA489 commented 2 years ago

/remove-lifecycle stale

neolit123 commented 2 years ago

Yeah, we can skip testing join in this PR but keep the related issue open. Once k/k code freeze is lifted we should make the change in k/k and add the join test. I don't think it's that complicated, but I haven't looked in more detail. On Apr 12, 2022 17:57, "SataQiu" @.***> wrote:

Emm... It doesn't look like the kubeadm join --dry-run command is using a fake client at all. https://github.com/kubernetes/kubernetes/blob/ 7380fc735aca591325ae1fabf8dab194b40367de/cmd/kubeadm/app/ cmd/join.go#L551-L563

Maybe it requires a major refactor to use a fake client for kubeadm join --dry-run :(

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubeadm/issues/2653#issuecomment-1096840095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRATAKANSC2YZPXQAO733VEWFNLANCNFSM5N5WID2A . You are receiving this because you were mentioned.Message ID: @.***>

neolit123 commented 2 days ago

reopening to update e2e test after https://github.com/kubernetes/kubernetes/pull/126776

kubernetes / kubeadm

add dry run e2e tests #2653