nutanix-cloud-native / cluster-api-runtime-extensions-nutanix

https://nutanix-cloud-native.github.io/cluster-api-runtime-extensions-nutanix/
Apache License 2.0
7 stars 4 forks source link

fix: Deterministic ordering of lifecycle hooks #847

Closed jimmidyson closed 1 month ago

jimmidyson commented 1 month ago

This commit introduces wrapper handlers to allow ordering of lifecyce hooks to be called via a single registered hook. This ensures that hooks are called in a defined order, fixing issues that have arisen that require CNI hook to be applied before metallb hook.

dlipovetsky commented 1 month ago

With one handler for a specific hook type running "sub-handlers" sequentially, we'll need to be careful to make sure the handler responds within the CAPI deadline of 30s.

We talked about running "sub-handlers" in parallel, and we can do that in follow-up PR.

dkoshkin commented 1 month ago

On main I see that timeout is 10seconds, am I looking at the correct place?

apiVersion: runtime.cluster.x-k8s.io/v1alpha1
kind: ExtensionConfig
metadata:
  annotations:
    runtime.cluster.x-k8s.io/inject-ca-from-secret: caren-system/cluster-api-runtime-extensions-nutanix-runtimehooks-tls
  creationTimestamp: "2024-08-08T22:21:32Z"
  generation: 2
  labels:
    cluster.x-k8s.io/provider: runtime-extension-caren
    clusterctl.cluster.x-k8s.io: ""
  name: cluster-api-runtime-extensions-nutanix
  resourceVersion: "4431"
  uid: 28e66cd5-b136-4f4b-b818-4c0037e6930f
status:
  - failurePolicy: Fail
    name: serviceloadbalancerhandler-acpi.cluster-api-runtime-extensions-nutanix
    requestHook:
      apiVersion: hooks.runtime.cluster.x-k8s.io/v1alpha1
      hook: AfterControlPlaneInitialized
    timeoutSeconds: 10
...
dlipovetsky commented 1 month ago

On main I see that timeout is 10seconds, am I looking at the correct place?

Good catch. Those 10s timeouts make sense for individual handlers, but we should increase it, perhaps to 30s, for the one aggregate handler.

dlipovetsky commented 1 month ago

On second thought, if the sub-handler fails, no other sub-handlers will run, so shortening the timeout isn't useful. But it will be necessary if/when we execute sub-handlers in parallel.

dlipovetsky commented 1 month ago

I'll reduce the MetalLB timeout to 10s, since the handler is already limited to 10s by CAPI's default. We'll want to raise the handler timeout to 30s, but in a way that fits with the library, and that'll need some thought.