Open YuriiVarvynets opened 1 year ago
Thanks for opening the issue @YuriiVarvynets. We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.
Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?
Would you be able to dump some logs from the operator around the upgrade? You might also want to enable debug logging (--zap-level=debug) and provide the logs for the replacement pod. cc @squaremo in case I am missing something obvious.
Thank you, @viveklak for taking a look at this. I enabled debug, but it takes time to reproduce the issue.
We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.
How can I increase leaseDurationSeconds for this lock. As default of 15 seconds is too low for my use case.
# kubectl get leases.coordination.k8s.io pulumi-kubernetes-operator-lock -o yaml -n pulumi
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
...
spec:
leaseDurationSeconds: 15
...
Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?
I have 5 stacks with basic S3 apps: https://github.com/pulumi/examples/tree/master/aws-py-s3-folder Graceful shutdown is set to default 5 minutes that should be enough for such a simple app to complete. However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right? What are other possible options to avoid such situation?
From the log with debug log level:
E1114 11:27:21.821634 1 leaderelection.go:325] error retrieving resource lock pulumi/pulumi-kubernetes-operator-lock: Get "https://172.16.0.1:443/apis/coordination.k8s.io/v1/namespaces/pulumi/leases/pulumi-kubernetes-operator-lock": context deadline exceeded
I1114 11:27:21.821759 1 leaderelection.go:278] failed to renew lease pulumi/pulumi-kubernetes-operator-lock: timed out waiting for the condition
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"pulumi","name":"pulumi-kubernetes-operator-lock","uid":"3df0dd70-932c-4a6e-ab8d-3ba09c372966","apiVersion":"v1","resourceVersion":"327159499"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"error","ts":"2022-11-14T11:27:21.821Z","logger":"cmd","msg":"Manager exited non-zero","error":"leader election lost","stacktrace":"main.main\n\t/home/runner/work/pulumi-kubernetes-operator/pulumi-kubernetes-operator/cmd/manager/main.go:179\nruntime.main\n\t/opt/hostedtoolcache/go/1.19.2/x64/src/runtime/proc.go:250"}
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"Lease","apiVersion":"coordination.k8s.io/v1"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"info","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.controller.stack-controller","msg":"Shutdown signal received, waiting for all workers to finish"}
When this happens pulumi backend lock may not be released.
Thanks for the details @YuriiVarvynets. Adding @squaremo for thoughts.
However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right?
Yes that is correct. I won't consider myself a controller-runtime locking expert, but it appears that failing to renew might result in the graceful shutdown period to be skipped? https://github.com/kubernetes-sigs/controller-runtime/blob/8da9760581ed4f43eee9c2f63764c1cbe0cd3104/pkg/manager/internal.go#L632
I can think of a couple of things we should probably do:
leaseDurationSeconds
pulumi cancel
as a prerequisite for a stack if necessary? I do think this could be risky and we would need to consider some safety valves here.pulumi cancel
is risky and should be optional. However, it make sense for me and for others as I was thinking of implementing some automatic tool that would clean up locked stacks anyway.
I see these factors:
The first item is a problem by itself, and the second makes it worse.
But I am missing why the lease fails to be renewed. @YuriiVarvynets can you explain why "default of 15 seconds is too low for my use case", and if that's the cause of the failed lease renewal -- e.g., are you running the operator in an environment where making an update every 15 seconds is unrealistic?
To be clear, I think it would be OK to have a longer leaseDurationSeconds
. It means less sensitivity to failures (it takes longer for a failure to be detected and for another leader to take over), but since stacks typically deploy in O(minutes), it would not be unreasonable to also take minutes to recover.
Update every 15 seconds is okay from infrastructure/load point of view. Right now, I see the issue several times during the week, which does not allow me to use pulumi operator in production.
If adding leaseDurationSeconds
and pulumi cancel
as configurable option will minimize the possibility of the issue it will be great.
Are there any estimates on resolving this issue ?
I am using azure repo and pulumi operator. But after deploy crd i am getting error message. Do you have any proper documentation on azure operator.
What happened?
If pulumi operator pod is restarted by any reason (for example, during pulumi operator upgrade process) it does not unlock the stack if it was processed during the restart.
Where
pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh
pod does not exist anymore.Configuration:
Steps to reproduce
Expected Behavior
Locks are released before pod restart.
Actual Behavior
Locks are NOT released and next operator leader can not perform any actions with Stacks.
Output of
pulumi about
No response
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).