kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.54k stars 1.3k forks source link

Cluster observedGeneration is updated without changing conditions on topology upgrades #11292

Open dkoshkin opened 4 days ago

dkoshkin commented 4 days ago

What steps did you take and what happened?

Followed the Quick Start for Docker with Kubernetes v1.31.0. Then updated the Cluster object and change the version to v1.31.1.

The observedGeneration got updated without any of the conditions changing.

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"cluster.x-k8s.io/v1beta1","kind":"Cluster","metadata":{"annotations":{},"name":"capi-quickstart","namespace":"default"},"spec":{"clusterNetwork":{"pods":{"cidrBlocks":["192.168.0.0/16"]},"serviceDomain":"k8s.test","services":{"cidrBlocks":["10.96.0.0/12"]}},"topology":{"class":"quick-start","controlPlane":{"metadata":{},"replicas":1},"variables":[{"name":"imageRepository","value":""},{"name":"etcdImageTag","value":""},{"name":"coreDNSImageTag","value":""},{"name":"podSecurityStandard","value":{"audit":"restricted","enabled":false,"enforce":"baseline","warn":"restricted"}}],"version":"v1.31.0","workers":{"machineDeployments":[{"class":"default-worker","name":"md-0","replicas":1}],"machinePools":[{"class":"default-worker","name":"mp-0","replicas":1}]}}}}
  creationTimestamp: "2024-10-15T19:42:12Z"
  finalizers:
  - cluster.cluster.x-k8s.io
  generation: 4
  labels:
    cluster.x-k8s.io/cluster-name: capi-quickstart
    topology.cluster.x-k8s.io/owned: ""
  name: capi-quickstart
  namespace: default
  resourceVersion: "1936"
  uid: d5602238-28ad-4b63-b785-d9766362e561
status:
  conditions:
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneInitialized
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneReady
  - lastTransitionTime: "2024-10-15T19:42:13Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-15T19:42:15Z"
    status: "True"
    type: TopologyReconciled
  infrastructureReady: true
  observedGeneration: 3
  phase: Provisioned

Notice the observedGeneration is now 4 without any conditions or other status changes.

status:
  conditions:
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneInitialized
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneReady
  - lastTransitionTime: "2024-10-15T19:42:13Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-15T19:42:15Z"
    status: "True"
    type: TopologyReconciled
  infrastructureReady: true
  observedGeneration: 4
  phase: Provisioned

What did you expect to happen?

We rely on a combination of watching for observedGeneration to equal generation and then for conditions on the Cluster to be True when determining when a cluster completes an upgrade.

In this case though, the observedGeneration is updated without any changes in the conditions, which causes a race condition and return prematurely before the upgrade is even started.

I would expect for some status to change along with observedGeneration that indicates that the spec has changes and needs to be processed.

Cluster API version

$ clusterctl version
clusterctl version: &version.Info{Major:"", Minor:"", GitVersion:"1.8.4", GitCommit:"brew", GitTreeState:"clean", BuildDate:"2024-10-08T05:24:23Z", GoVersion:"go1.23.2", Compiler:"gc", Platform:"darwin/arm64"}

Kubernetes version

v1.31.0 > v1.31.1

Anything else you would like to add?

We're currently hacked around this using a sleep between changing the resource and starting the wait, but would appreciate some guidance from others who have seen this and have other ideas.

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

k8s-ci-robot commented 4 days ago

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
sbueringer commented 4 days ago

Which of the conditions would you have expected to change?

That being said we're working on new conditions for the next release #11291 I think this will be probably solved through new additional conditions that also contain observedGeneration