When starting KLM, we can see a delay between the Pod running, and KLM actually picking up work (starting the reconciliation). We need to investigate the reason for this delay and if we can prevented.
To Reproduce (observable on local deployment as well as in KCPs):
Get a new Pod of KLM created. Observe that directly after start up, we can see log entries like the ones below (starting servers, trying to acquire leader lease, checking stored versions, ...). However, it then sits there for about 3 minutes not doing anything. After about 2-3 mins, we then see reconcilation logs and things happening on the clusters.
One hint to look at is the acquisition of the leader lease. We can see a log that it is attempting to acquire the lease. The confirmation that it got the lease only comes in about 2-3 mins later.
kubectl logs deploy/klm-controller-manager -n kcp-system
{"level":"INFO","date":"2024-09-06T07:01:01.476464588Z","logger":"setup","caller":"cmd/main.go:104","msg":"starting Lifecycle-Manager version: from_dockerfile","context":{}}
{"level":"INFO","date":"2024-09-06T07:01:01.675367088Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:186","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","context":{"GVK":"operator.kyma-project.io/v1beta2, Kind=ModuleTemplate"}}
{"level":"INFO","date":"2024-09-06T07:01:01.675943505Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:202","msg":"Registering a validating webhook","context":{"GVK":"operator.kyma-project.io/v1beta2, Kind=ModuleTemplate","path":"/validate-operator-kyma-project-io-v1beta2-moduletemplate"}}
{"level":"INFO","date":"2024-09-06T07:01:01.677310963Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","context":{"path":"/validate-operator-kyma-project-io-v1beta2-moduletemplate"}}
{"level":"INFO","date":"2024-09-06T07:01:01.67917138Z","logger":"controller-runtime.metrics","caller":"server/server.go:208","msg":"Starting metrics server","context":{}}
{"level":"INFO","date":"2024-09-06T07:01:01.679687047Z","logger":"controller-runtime.metrics","caller":"server/server.go:247","msg":"Serving metrics server","context":{"bindAddress":":8080","secure":false}}
{"level":"INFO","date":"2024-09-06T07:01:01.68032863Z","caller":"manager/server.go:83","msg":"starting server","context":{"name":"health probe","addr":"[::]:8081"}}
{"level":"INFO","date":"2024-09-06T07:01:01.681180005Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:191","msg":"Starting webhook server","context":{}}
{"level":"INFO","date":"2024-09-06T07:01:01.684769838Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:161","msg":"Updated current TLS certificate","context":{}}
{"level":"INFO","date":"2024-09-06T07:01:01.685888172Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","context":{"host":"","port":9443}}
I0906 07:01:01.687028 1 leaderelection.go:254] attempting to acquire leader lease kcp-system/893110f7.kyma-project.io...
{"level":"LEVEL(-2)","date":"2024-09-06T07:01:01.689037588Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:21","msg":"Handling dropping stored versions for, map[Kyma:v1beta1 Manifest:v1beta1 ModuleTemplate:v1beta1 Watcher:v1beta1]","context":{}}
{"level":"INFO","date":"2024-09-06T07:01:01.692686922Z","logger":"controller-runtime.certwatcher","caller":"certwatcher/certwatcher.go:115","msg":"Starting certificate watcher","context":{}}
{"level":"LEVEL(-2)","date":"2024-09-06T07:01:01.693148797Z","logger":"setup","caller":"cmd/main.go:255","msg":"scheduled job for cleaning up metrics","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.169369714Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:33","msg":"Checking the storedVersions for Manifest crd","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.169440755Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:42","msg":"The new storedVersions are [v1beta2]","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.18142938Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:33","msg":"Checking the storedVersions for ModuleTemplate crd","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.18145963Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:42","msg":"The new storedVersions are [v1beta2]","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.186495547Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:33","msg":"Checking the storedVersions for Watcher crd","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.186523839Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:42","msg":"The new storedVersions are [v1beta2]","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.194104589Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:33","msg":"Checking the storedVersions for Kyma crd","context":{}}
{"level":"DEBUG","date":"2024-09-06T07:01:02.194134672Z","logger":"storage-version-migration","caller":"crd/storage_version_dropper.go:42","msg":"The new storedVersions are [v1beta2]","context":{}}
Also noticed that this delay only seems to happen when there is an existing KLM pod which is replaced by a new one. In a fresh cluster with fresh KLM deployment, the delay does not seem to happen.
Description
When starting KLM, we can see a delay between the Pod running, and KLM actually picking up work (starting the reconciliation). We need to investigate the reason for this delay and if we can prevented.
To Reproduce (observable on local deployment as well as in KCPs):
Get a new Pod of KLM created. Observe that directly after start up, we can see log entries like the ones below (starting servers, trying to acquire leader lease, checking stored versions, ...). However, it then sits there for about 3 minutes not doing anything. After about 2-3 mins, we then see reconcilation logs and things happening on the clusters.
One hint to look at is the acquisition of the leader lease. We can see a log that it is attempting to acquire the lease. The confirmation that it got the lease only comes in about 2-3 mins later.
2-3 mins pause
Investigation task timboxed for: 2d
Reasons
Speed up local testing, avoid disruptions in production (when rolling out changes like with the certs).
Acceptance Criteria
Feature Testing
No response
Testing approach
No response
Attachments
No response