upbound / provider-terraform

A Crossplane provider for Terraform
https://marketplace.upbound.io/providers/upbound/provider-terraform/
Apache License 2.0
150 stars 59 forks source link

provider-terraform thrashes disk computing checksums with plugin cache disabled #196

Closed toastwaffle closed 1 year ago

toastwaffle commented 1 year ago

What happened?

I found a nice footgun, and proceeded to use it :)

We have the plugin cache disabled, we're using both the google and google-beta providers, and we managed to cause a provider version upgrade. Initially, we didn't have -upgrade=true in our initArgs, so the provider was repeatedly failing to reconcile. Once we added that in, provider-terraform was able to upgrade the providers and reconcile the workspaces.

At this point, provider-terraform started running very slowly, to the point that it was consistently at its --max-reconcile-rate limit, and workspaces were appearing back on the queue after the poll interval before the queue emptied.

During this time, execing into the pod and running top and iostat showed low CPU/RAM usage, but a very large number of Disk operations. This was reflected in the GCP monitoring for the GKE node the pod was running on.

My hypothesis is that computing the md5sums over the TF provider binaries caused too much disk usage, leading to throttling. After restarting the pod (which threw away the old binaries), provider-terraform was able to clear the queue.

I know there are a couple of options for mitigation that we can try to do:

However, I think there a couple of possible improvements to provider-terraform:

How can we reproduce it?

What environment did it happen in?

toastwaffle commented 1 year ago

I believe it should be possible to set up persistent volumes using the new DeploymentRuntimeConfig, and the environment variable XP_TF_DIR to make the controller use such a volume for the workspaces, so I think this is actually fixed (by #197).