Closed killianmuldoon closed 2 months ago
@kubernetes-sigs/cluster-api-release-team These flakes are very disruptive to the test signal right now. It would be great if someone could prioritize investigating and fixing them out ahead of the releases.
/triage accepted
/help
@killianmuldoon: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
Note that each branch has a different number of variants, enumerated below, of this test which may be responsible for some unevenness in the signal:
release-1.4
: 7release-1.5
: 6main
: 5I am looking into this one.
I will be pairing up with @adilGhaffarDev on this one since it is happening more frequently.
/assign @adilGhaffarDev
Adding a bit more explanation regarding failures. We have three failures in clusterctl upgrade:
exec.ExitError
This one happens at Applying the cluster template yaml to the cluster
, I opened PR in release 1.4 and changed KubectlApply to ControllerRuntime create and also added ignore for alreadyExists
as @killianmuldoon suggested, so far I haven't seen this failure on my 1.4 PR(ref:https://github.com/kubernetes-sigs/cluster-api/pull/9731). It still fails but not at Apply, I think changing kubectlApply
to Create
and adding ignore on alreadyExists
fixes this one. I will create PR on the main
too.failed to discovery ownerGraph types
this one happens at Running Post-upgrade steps against the management cluster
. I have looked into logs and I am seeing this error:
{"ts":1700405055471.4797,"caller":"builder/webhook.go:184","msg":"controller-runtime/builder: Conversion webhook enabled","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerClusterTemplate"}
{"ts":1700405055471.7637,"caller":"builder/webhook.go:139","msg":"controller-runtime/builder: skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"}
{"ts":1700405055472.0557,"caller":"builder/webhook.go:168","msg":"controller-runtime/builder: skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"}
It might be something related to DockerMachinePool, we might need to backport the recent fixes related to DockerMachinePool. Another interesting thing is I don't see this failure on main
this is only happening on v1.4
and v1.5
.
failed to find releases
this one happens at clusterctl init
. I am still looking into this one.I have looked into logs and I am seeing this error:
{"ts":1700405055471.4797,"caller":"builder/webhook.go:184","msg":"controller-runtime/builder: Conversion webhook enabled","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerClusterTemplate"} {"ts":1700405055471.7637,"caller":"builder/webhook.go:139","msg":"controller-runtime/builder: skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"} {"ts":1700405055472.0557,"caller":"builder/webhook.go:168","msg":"controller-runtime/builder: skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"}
This is not an error. These are just info messages that surface that we are calling ctrl.NewWebhookManagedBy(mgr).For(c).Complete()
for an object that has no validating or defaulting webhooks (we still get the same on main as we should)
Update on this issue. I am not seeing following flakes anymore:
exec.ExitError
failed to find releases
failed to discovery ownerGraph types
flake is still happening but only when upgrading from (v0.4=>current)
@adilGhaffarDev So the clusterctl upgrade test is 100% stable apart from "failed to discovery ownerGraph types flake is still happening but only when upgrading from (v0.4=>current)"?
Is not showing anything for me
@adilGhaffarDev So the clusterctl upgrade test is 100% stable apart from "failed to discovery ownerGraph types flake is still happening but only when upgrading from (v0.4=>current)"?
sorry for the bad link, here is more persitent link: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&test=clusterctl%20upgrades%20&xjob=.*-provider-.*
Maybe not 100% stable there are very minor flakes that happen sometimes. But failed to find releases
and exec.ExitError
are not happening anymore.
@adilGhaffarDev exec.ExitError
does not occur anymore because I improved the error output here: https://github.com/kubernetes-sigs/cluster-api/blob/adce02023c22d8681eb4ff5e0ae8df9eee5b8420/test/framework/cluster_proxy.go#L258
(https://github.com/kubernetes-sigs/cluster-api/pull/9737/files)
That doesn't mean the underlying errors are fixed unfortunately.
@adilGhaffarDev
exec.ExitError
does not occur anymore because I improved the error output here:
exec.ExitError
was happening at step: INFO: Applying the cluster template yaml to the cluster
I don't see any failure that is happening at the same step where exec.ExitError
was happening. Do you see any failure on triage that is related to that? I am unable to find it.
Sounds good! Nope I didn't see any. Just wanted to clarify that the errors would look different now. But if the same step works now, it should be fine.
Just not sure what changed as I don't remember fixing/changing anything there.
Just not sure what changed as I don't remember fixing/changing anything there.
This is the new error that was happening after your PR, it seems like it stopped happening after 07-12-2023. https://storage.googleapis.com/k8s-triage/index.html?date=2023-12-10&job=.*-cluster-api-.*&xjob=.*-provider-.*#6710a9c85a9bbdb4d278
Only PR on 07-12-2023 that might have fixed this seemed to be this one: https://github.com/kubernetes-sigs/cluster-api/pull/9819 , but I am not sure.
So this is the error we get there
{Expected success, but got an error:
<*errors.fundamental | 0xc000912948>:
exit status 1: stderr:
{
msg: "exit status 1: stderr: ",
stack: [0x1f3507a, 0x2010aa2, 0x84e4db, 0x862a98, 0x4725a1],
} failed [FAILED] Expected success, but got an error:
<*errors.fundamental | 0xc000912948>:
exit status 1: stderr:
{
msg: "exit status 1: stderr: ",
stack: [0x1f3507a, 0x2010aa2, 0x84e4db, 0x862a98, 0x4725a1],
}
This is the corresponding output (under "open stdout")
Running kubectl apply --kubeconfig /tmp/e2e-kubeconfig3133952171 -f -
stderr:
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
stdout:
So looks like the mgmt cluster was not reachable.
Thx for digging into this. I would say let's ignore this error for now as it's not occurring anymore. Good enough for me to know the issue stopped happening (I assumed it might be still there and just looks different).
Little more explanation to clusterctl upgrade failure. Now we are seeing only one flake when upgrading from 0.4->1.4
or 0.4->1.5
, as mentioned before. Its failing with following error:
failed to discovery ownerGraph types: action failed after 9 attempts: failed to list "infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerCluster" resources: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha4, Kind=DockerCluster failed: Post "https://capd-webhook-service.capd-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority
This failure happens in post upgrade step, where we are are calling ValidateOwnerReferencesOnUpdate
. We have this post upgrade step only when we upgrading from v1alpha
to v1beta
. I believe @killianmuldoon you have worked on it, can you check this when you get time.
Little more explanation to clusterctl upgrade failure. Now we are seeing only one flake when upgrading from
0.4->1.4
or0.4->1.5
, as mentioned before. Its failing with following error:
failed to discovery ownerGraph types: action failed after 9 attempts: failed to list "infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerCluster" resources: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha4, Kind=DockerCluster failed: Post "https://capd-webhook-service.capd-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority
This failure happens in post upgrade step, where we are are calling
ValidateOwnerReferencesOnUpdate
. We have this post upgrade step only when we upgrading fromv1alpha
tov1beta
. I believe @killianmuldoon you have worked on it, can you check this when you get time.
🤔 : may be helpful to collect cert-manager resources + logs to analyse this. Or is this locally reproducible?
🤔 : may be helpful to collect cert-manager resources + logs to analyse this. Or is this locally reproducible?
I haven't been able to reproduce locally. I have ran it multiple times.
Some observation via #10193 :
I0223 19:20:31.564028 1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-key-manager" key="capd-system/capd-serving-cert" error="Operation cannot be fulfilled on certificates.cert-manager.io \"capd-serving-cert\": the object has been modified; please apply your changes to the latest version and try again"
Maybe related cert-manager issue: https://github.com/cert-manager/cert-manager/issues/6464
Edit updated #10193 to now hopefully collect the cert-manager CRs. Maybe we can implement something which waits for the certificates to be ready or similar.
Maybe: worth taking a look if there could be an improvement to the clusterctl upgrade
flow. The question is:
Certificate
and CertificateSigningRequest
, so the CRD contains an old CA and we need to wait for cert-manager for the new Certificate to be issued and ca-injector to inject the new certificate.Fix is here for the failed to discovery ownerGraph types
error:
this should catch all x509: certificate signed by unknown authority
/ failed to discovery ownerGraph types
errors which occur in clusterctl_upgrade
tests related to conversion webhooks.
Should get cherry-picked to all supported branches.
link for all x509 errors to check occurencies of the flake and confirm it is fixed.
@chrischdi thank you for working on it, now we are not seeing this flake too much, nice work. On k8s triage I can see that now ownergraph flake is only happening in (v0.4=>v1.6=>current)
tests, the other flakes seem to be fixed or they are much less flaky.
ref: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#4f4c67c927112191922f
@chrischdi thank you for working on it, now we are not seeing this flake too much, nice work. On k8s triage I can see that now ownergraph flake is only happening in
(v0.4=>v1.6=>current)
tests, the other flakes seem to be fixed or they are much less flaky. ref: storage.googleapis.com/k8s-triage/index.html?job=.-cluster-api-.&xjob=.-provider-.#4f4c67c927112191922f
Note: this is a different flake, not directly ownergraph but similar. It happens at a different place though.
We could propably also ignore the x509 errors here and ensure that the last try in Consistently
succeeded (by storing and checking the last error outside of Consistently
)
We could propably also ignore the x509 errors here and ensure that the last try in Consistently succeeded (by storing and checking the last error outside of Consistently)
We could also add an Eventually before to wait until the List call works and then keep the Consistently the same
Btw, thx folks, really nice work on this issue!
We could also add an Eventually before to wait until the List call works and then keep the Consistently the same
I will open a PR with your suggestion
(v0.4=>v1.6=>current)
tests. I will try to reproduce it locally./priority important-soon
I implemented a fix at #10469 which should fix the situation.
This issue is labeled with priority/important-soon
but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
/triage accepted
(org members only)/priority important-longterm
or /priority backlog
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted Let's also consider to close and open a new one with the current state
/triage accepted Let's also consider to close and open a new one with the current state
agreed, i think a new issue would be helpful. the incoming release CI team can prioritize this. @chandankumar4 @adilGhaffarDev @Sunnatillo is there a summary of where we stand? if not, ill take a shot at refreshing the investigation and can open the new issue
seems like we do have flakes on main
with a few different patterns shown for today: https://storage.googleapis.com/k8s-triage/index.html?date=2024-08-17&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades.*
/triage accepted Let's also consider to close and open a new one with the current state
agreed, i think a new issue would be helpful. the incoming release CI team can prioritize this. @chandankumar4 @adilGhaffarDev @Sunnatillo is there a summary of where we stand? if not, ill take a shot at refreshing the investigation and can open the new issue
seems like we do have flakes on
main
with a few different patterns shown for today: https://storage.googleapis.com/k8s-triage/index.html?date=2024-08-17&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades.*
From my observation I would say there are two main flakes that are occuring in clusterctl upgrade tests:
{Expected success, but got an error:
<errors.aggregate | len:3, cap:4>:
[Internal error occurred: failed calling webhook "default.dockercluster.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capd-webhook-service.capd-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-dockercluster?timeout=10s": dial tcp 10.96.44.105:443: connect: connection refused, Internal error occurred: failed calling webhook "validation.dockermachinetemplate.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capd-webhook-service.capd-system.svc:443/validate-infrastructure-cluster-x-k8s-io-v1beta1-dockermachinetemplate?timeout=10s": dial tcp 10.96.44.105:443: connect: connection refused]
Timed out after 300.001s.
Timed out waiting for all Machines to exist
Expected
<int64>: 0
to equal
<int64>: 2
[FAILED] Timed out after 300.001s.
Timed out waiting for all Machines to exist
Expected
<int64>: 0
to equal
<int64>: 2
In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:500
First flake happening more often and when upgrading from latest versions, second flake happening mostly when uplifting from older releases.
I agree that we should close this issue and open new one separately for each flake.
@chrischdi Was looking into some of these issues and is about to write an update here. Let's wait for that before closing this issue
Sorry folks, took longer than expected.
According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.
But it looks like none of them are the ones in the initial post:
36 failures: Timed out waiting for all Machines to exist
16 Failures: Failed to create kind cluster
14 Failures: Internal error occurred: failed calling webhook [...] connect: connection refused
7 Failures: x509: certificate signed by unknown authority
5 Failures: Timed out waiting for Machine Deployment clusterctl-upgrade/clusterctl-upgrade-workload-... to have 2 replicas
2 Failures: Timed out waiting for Cluster clusterctl-upgrade/clusterctl-upgrade-workload-... to provision
Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here
thank you for putting this together @chrischdi -- you mind if i copy paste this refreshed summary into a new issue and close the current one?
Feel free to go ahead with that
Doesn't hurt to start with a clean slate to reduce confusion :)
/close
in favor of https://github.com/kubernetes-sigs/cluster-api/issues/11133
@cahillsf: Closing this issue.
The
clusterctl
upgrade tests have been significantly flaky in the last couple of weeks, with flakes occurring onmain
release-1.4
andrelease-1.5
.The flakes are occurring across many forms of the
clusterctl
upgrade tests includingv0.4=>current
,v1.3=>current
andv1.0=>current
.The failures take a number of forms, including but not limited to:
exec.ExitError
: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&xjob=.*-provider-.*#f5ccd02ae151196a4bf1failed to find releases
: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*#983e849a73bad197d73bfailed to discovery ownerGraph types
: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*#176363ebfcd19172c1acThere's an overall triage for tests with
clusterctl upgrades
in the name here: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*/kind flake