Closed toastwaffle closed 1 year ago
I believe it should be possible to set up persistent volumes using the new DeploymentRuntimeConfig
, and the environment variable XP_TF_DIR
to make the controller use such a volume for the workspaces, so I think this is actually fixed (by #197).
What happened?
I found a nice footgun, and proceeded to use it :)
We have the plugin cache disabled, we're using both the google and google-beta providers, and we managed to cause a provider version upgrade. Initially, we didn't have
-upgrade=true
in ourinitArgs
, so the provider was repeatedly failing to reconcile. Once we added that in, provider-terraform was able to upgrade the providers and reconcile the workspaces.At this point, provider-terraform started running very slowly, to the point that it was consistently at its --max-reconcile-rate limit, and workspaces were appearing back on the queue after the poll interval before the queue emptied.
During this time, execing into the pod and running
top
andiostat
showed low CPU/RAM usage, but a very large number of Disk operations. This was reflected in the GCP monitoring for the GKE node the pod was running on.My hypothesis is that computing the md5sums over the TF provider binaries caused too much disk usage, leading to throttling. After restarting the pod (which threw away the old binaries), provider-terraform was able to clear the queue.
I know there are a couple of options for mitigation that we can try to do:
However, I think there a couple of possible improvements to provider-terraform:
How can we reproduce it?
initArgs: ['upgrade=true']
in workspacesinitArgs: ['upgrade=true']
in workspacesWhat environment did it happen in?