Open mnaser opened 1 year ago
I just noticed over here you can also notice that it's actually trying to pull v1alpha6
:
I1014 02:13:28.108463 1 reconcile_state.go:284] "Patching OpenStackCluster/kube-cmd33-4k46x" controller="topology/cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="magnum-system/kube-cmd33" namespace="magnum-system" name="kube-cmd33" reconcileID=2e8e2bc8-cc63-4b35-8c25-b83181bbd883 resource={Group:infrastructure.cluster.x-k8s.io Version:v1alpha6 Resource:OpenStackCluster} OpenStackCluster="magnum-system/kube-cmd33-4k46x"
after hours of troubleshooting, i put up an issue and i figure it out. it was the version from the clusterclass. it seems the mismatch is happening there if there's an old version from the clusterclass and a new version installed on the cluster that it constantly reconciles.
kinda feels related to #9384 which i ran into when trying to troubleshoot.
Thanks for reporting this and following up with your troubleshooting.
Just so I'm clear - you have a v1alpha6 Infrastructure cluster referenced under ClusterClass .spec.infrastructure.ref
? The result of this is continuous reconciliation of the Cluster.
Is there an error being returned or is it just the Patching...
line that gets repeated? Generally we wouldn't expect continous reconciliation here so there might be a bug there.
We could also possibly mitigate the core issue either by automatically updating the references in the ClusterClass to the latest storage version, which is what we do for Clusters and is the subject of #9384, or we could at least return a webhook warning when the version referenced is not the latest storage version.
As in #9384 I think the real solution to this issue will be to remove the apiVersion from these object references in a future version of CAPI, but that requires and API revision.
To follow up - I wasn't able to reproduce this simply in CAPD. I set the apiVersion to v1alpha4 and the Cluster came up without issue and there doesn't seem to be any runaway reconcile.
It would be really useful to get the actual logs from this instance to get a better understanding of what might be going wrong and how to reproduce it in a test case.
/triage accepted
It looks like the APIServer does produce warnings, at least when the apiVersion is marked as deprecated
so at least we don't have introduce that on the CAPI side.
cc @sbueringer
Thanks for reporting this and following up with your troubleshooting.
Just so I'm clear - you have a v1alpha6 Infrastructure cluster referenced under ClusterClass
.spec.infrastructure.ref
? The result of this is continuous reconciliation of the Cluster.
Sorry if my report was a bit all over the place, let me try and tl;dr it:
OpenStackClusterTemplate
with apiVersion: infrastructure.cluster.x-k8s.io/v1alpha6
OpenStackCluster
with an older apiVersion
OpenStackCluster
with v1alpha7
(because the Cluster
resource had infrastructureRef.apiVersion
set to v1alpha7
).v1alpha6
(generated) vs v1alpha7
(retrieved)v1alpha6
, it would get a v1alpha7
in responseand this kept looping and looping non-stop. The key thing here would be that the referenced template would have an older version than the one in the Cluster (which is automatically 'bumped' by CAPI when it sees a new version available).
Hm. I'm wondering if there is something wrong in the conversion webhook of the OpenStack cluster.
The controller is not simply comparing v1alpha6 (generated) vs v1alpha7 (retrieved). It is running through a few SSA dry-runs and then it compares if re-applying generated would lead to a diff. In general I would have expected that this leads to no diffs as the v1alpha6 generated object should go through conversion, defaulting, ... . Obviously looks like it's not enough.
@mnaser Do you think there is any way that I can reproduce this locally? I'm thinking about deploying core CAPI + CAPO via our tilt env and then deploying a bunch of YAMLs for the OpenStack cluster. I think to reproduce this effect it's not relevant if the cluster actually comes up. WDYT, would it be possible to provide the YAMLs for the OpenStackCluster?
Q: Do you know if the OpenStack cluster controller is writing the OpenStack cluster object?
I'm not sure if I'll be able to debug this otherwise, this is one of the most complicated areas of Cluster API.
@sbueringer i think that @mdbooth might have better insights about this.
Otherwise if you'd like I can spin up an environment against our public cloud to help save your time if you don't easily have access to an OpenStack cloud (or can provide credentials)
Something like this would be helpful. If possible an env + the YAML's you're using to hit this issue would be great. I probably won't get to it before KubeCon though.
I've been thinking about this since I first raised it a few months ago. As implemented, the apiVersion field is a status field, not a spec field, because it can't be specified. I think the simplest solution until we can fix it in an API bump would be to allow it be be unset. The controller will populate it anyway so it's still there for any consumers of it, but it doesn't break SSA for the client because there's no owner fight.
/priority important-soon
This issue is labeled with priority/important-soon
but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
/triage accepted
(org members only)/priority important-longterm
or /priority backlog
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted
What steps did you take and what happened?
When using CAPO, I've noticed that I had a cluster that was reconciling non-stop and eating up a ton of CPU, upon further troubleshooting, I noticed that the reconciler seems to not actually grab the latest version of the CRD when making the request (my guess is that in the db, it's still using v1alpha6 but presenting v1alpha7 for user).
You can see that v1alpha7 is the newest version:
The
Cluster
resource agrees with this too:However, you can see that when it tries to make a request to update when bringing the verbosity all the way up, snipped this from logs:
And because of that, it almost always 'notices' a change, and loops endlessly, I tried to make a diff with the info that it is sending...
So because there was a change in the
OpenStackCluster
, and its' pulling v1alpha6 (somehow) and v1alpha7 is the real expected version, it's just looping.. I feel like there's a spot here where it was missed to pull the up to date version of theinfrastructureRef
..What did you expect to happen?
No loops and none of this to happen:
looping.. non stop...
Cluster API version
Cluster API 1.5.1 + CAPO 0.8.0
Kubernetes version
No response
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.