pulumi / pulumi-kubernetes-operator

A Kubernetes Operator that automates the deployment of Pulumi Stacks
Apache License 2.0
213 stars 54 forks source link

Handle existing workspace directories better #552

Closed JDTX0 closed 3 months ago

JDTX0 commented 4 months ago

Proposed changes

I've changed the stack reconciliation code to clean up existing workspace directories when it finds them instead of treating them as a lock and failing forever. We've seen numerous times where the operator leaves a workspace directory behind for some unknown reason and it causes the stack to fail to reconcile forever. The only way to resolve the issue is to remove the directory manually or restart the entire operator pod.

The operator shouldn't treat directories as locks in my opinion. Given that they're processed by 1 thread at a time and Pulumi has its own state lock files (for SaaS & self-hosted backends), an existing directory shouldn't block the reconciliation and cause a failure that never resolves itself.

Also, I fixed a few typos. Let me know if there's anything I've missed for this or if there's any concerns with this approach.

github-actions[bot] commented 4 months ago

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting. Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

github-actions[bot] commented 4 months ago

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting. Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

github-actions[bot] commented 4 months ago

PR is now waiting for a maintainer to run the acceptance tests. This PR will only perform build and linting. Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

rquitales commented 3 months ago

@JDTX0, thank you for your contribution, and I apologize for the oversight in reviewing this PR earlier. Upon further examination of the codebase, I agree with your assessment that it is guaranteed that only 1 thread/process can process a stack at a time when using a single replica for our operator deployment. Furthermore, even if a user were to adjust the replica count of their operator deployment manifests, they would likely encounter other issues due to the lack of native support for horizontal scaling. As such, it seems reasonable to remove the rudimentary lock check based on the assumption of the existence of the work directory.

However, I'm concerned with why these directories are not being cleaned up initially. If you have any insights into why this occurs in your deployments, that would be a helpful starting point for my investigations.

rquitales commented 3 months ago

/run-acceptance-tests

JDTX0 commented 3 months ago

@rquitales Thanks for the reply and review! We found that the operator was losing the leader election as the lease renewal timed out. The lease renewal turned out to be a problem with etcd having performance degradation.

The election failure caused the container to restart and leave behind directories which leads to this issue. So, not directly a problem in the operator code.

Regardless, I'd still very much like this change merged as it makes the operator more resilient to a myriad of possible problems, lease-related or otherwise.