Closed toastwaffle closed 4 months ago
@toastwaffle - are you running with --max-reconcile-rate
set to a value larger than 1? By default the provider runs with that parameter set to 1 which means it is essentially single-threaded. The provider uses the terraform CLI so it needs a CPU for each thread (potentially). If you set max-reconcile-rate
to 5, it could use up to 5 CPUs to run 5 concurrent reconciliations. See https://github.com/upbound/provider-terraform/blob/main/README.md#known-limitations
@bobh66 yes we are, but if you look at the graphs above we're not CPU-bound. We currently have --max-reconcile-rate=50
, with 4 CPUs requested and a limit of 7.5. I did wonder if this was related (see this comment on a different issue), but if we were being limited to our CPU count I'd expect reconciles to still be making progress instead of getting completely stuck.
I wonder if the large difference between max-reconcile-rate
and the available CPU count is a factor? It would be interesting to see what happens if you set them both the same value (4 in this case). If you still see the problem then I have to assume the locking is not working as expected. I'll take a look at the locking code again and see if anything obvious jumps out.
There may be a quirk to the way the locking works that is problematic. If I'm reading the code correctly:
This can result in a long-running RLock process blocking the acquisition of new RLocks if there is also a Lock pending.
The Lock request will wait for the long-running RLock process to finish before it runs, and meanwhile will block all other RLock requests so we end up single-threaded until the long-running RLock thread is finished, at which time the Lock process runs and then unblocks all of the other RLock requests.
Practically this means that any time a new Workspace is created, regardless of it's content, because it needs to run terraform init
it will block the reconciliation of all other Workspaces while it waits for any current RLock processes to finish.
This is all conjecture for the moment but I think it sounds plausible. Figuring out how to solve it is the next problem.
Oh, yeah, that makes sense, and explains why in the example above a few things did get processed while the first create was running, until the second create entered the metaphorical "queue" and called Lock()
One potential workaround: can we avoid locking entirely if the plugin cache is disabled? Currently the lock is used regardless of whether the plugin cache is enabled
Just thinking out loud - do we need to be RLock'ing for terraform plan
? If it fails due to some race condition we'll just retry it on the next reconciliation anyway. init
and apply
definitinitely need to Lock/RLock, and I'm not sure about delete
. But it seems like we could mitigate much of the problem if we didn't lock on plan
and just allow the reconciliation to try again if it fails due to some race condition. @ytsarev @negz ?
One potential workaround: can we avoid locking entirely if the plugin cache is disabled? Currently the lock is used regardless of whether the plugin cache is enabled
That's a possibility too - the only reason for locking is to prevent cache corruption issues, so if the cache is disabled there is no need to lock.
I can make a PR for that.
Not locking for a plan operation would also help, but we'd still potentially underutilise concurrent reconciles. Challenging to fix without risking starvation though :/
Yes - new resource creation during long-running applies/deletes will still cause throttling but at least the other reconciliations could still run assuming no changes in the terraform plan
output.
A better overall solution may be to replace the in-memory storage of terraform workspaces with a persistent volume that can be sized as needed. That would allow sufficient storage space to run with caching disabled without using excessive memory, and it also allows for pod updates to not have to re-run terraform init on all Workspaces since the workspaces are persistent. I'm pretty sure we could do this today using a DeploymentRuntimeConfig
mounted on /tf
It doesn't even need to be mounted on /tf
; the XP_TF_DIR
env var can make it use a different location
@bobh66 we intentionally avoid the lock on plan
https://github.com/upbound/provider-terraform/blob/main/internal/terraform/terraform.go#L503 so your idea sounds safe.
~-I think 503 is the creation of the Command object but it doesn't get executed until https://github.com/upbound/provider-terraform/blob/main/internal/terraform/terraform.go#L514 which is after the RLock()
call.-~
I think @ytsarev was referring to the -lock=false
flag?
Should I remove the locking for plans in #240, or is that best done separately?
I think @ytsarev was referring to the -lock=false flag?
You're right - my mistake. Apologies @ytsarev
Should I remove the locking for plans in https://github.com/upbound/provider-terraform/pull/240, or is that best done separately?
I think it makes sense to do it in #240
Just to report back on this - I just upgraded our production cluster to 0.14.1 successfully, and discovered a small side effect of not locking for plans. As the provider was starting up and recreating all of the workspaces (hopefully for the last time as we added a persistent volume at the same time), we saw the same plugin cache conflict errors that we saw previously:
Error: Failed to install provider
Error while installing hashicorp/google-beta v5.0.0: open
/tf/plugin-cache/registry.terraform.io/hashicorp/google-beta/5.0.0/linux_amd64/terraform-provider-google-beta_v5.0.0_x5:
text file busy
What I think is happening here is that while we do only have one terraform init
running, the fact that terraform plan
operations can run at the same time causes a conflict. I didn't reproduce that in my local testing because I wasn't creating Workspaces fast enough to trigger it, and I already had the persistent volume.
The backoff and retries resolved this eventually - on startup it took about 15 minutes to re-initialise our 150 workspaces (with --max-reconcile-rate=50
; our maximum CPU usage was 2.5 CPU). However, those errors aren't ideal, so maybe we want to reinstate the locking for terraform plan
?
I saw the same behavior but I'm not sure it merits reinstating the lock for plan
. My next step is to create a PVC and mount it to the pod so that the /tf
directory is persistent across reboots. That should eliminate the need to run init
on all Workspaces
on a restart, and also preserve the cache content. That seems like a more stable solution in general, especially for installations with large numbers of Workspaces
.
Ack, the PV has definitely helped us. Looking at our graphs today with lots of Workspaces being created and destroyed, I'm confident that this is now fixed.
Thank you Bob for figuring out the root cause and helping get the fix reviewed and released!
Ah sorry for all the issues with the locks. The terraform cache is really really picky. I feel like the execution model for the terraform plugin is just poor in general. A PVC would help. I think sharding would also help greatly if you have a lot of terraform workspaces. @toastwaffle can you share your grafana graph json by chance? I would like to analyze our pipelines similarly.
We're using some recording rules and custom metrics, but the raw PromQL queries are:
sum by (controller)(controller_runtime_max_concurrent_reconciles{controller="managed/workspace.tf.upbound.io"})
sum by (controller)(controller_runtime_active_workers{controller="managed/workspace.tf.upbound.io"})
sum by (controller)(rate(controller_runtime_reconcile_total{controller="managed/workspace.tf.upbound.io"}[5m]))
sum by ()(rate(container_cpu_usage_seconds_total{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"}[5m]))
sum by ()(container_memory_working_set_bytes{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"})
sum by ()(container_memory_usage_bytes{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"})
I am facing the issue even on PV enabled .
Warning: Unable to open CLI configuration file │ │ The CLI configuration file at "./.terraformrc" does not exist. ╵
Error: Failed to install provider
Error while installing hashicorp/aws v5.44.0: open /tf/plugin-cache/registry.terraform.io/hashicorp/aws/5.44.0/linux_arm64/terraform-provider-aws_v5.44.0_x5: text file busy
Error: Failed to install provider
Error while installing hashicorp/kubernetes v2.30.0: open /tf/plugin-cache/registry.terraform.io/hashicorp/kubernetes/2.30.0/linux_arm64/terraform-provider-kubernetes_v2.30.0_x5: text file busy
@balu-ce I guess you might need to disable completely the TF provider cache (maybe by using some environment variable or by putting a configuration value into the /.terraformrc). It looks like no one mentioned this, but the fix implies both:
After that it makes perfect sense that you don't get any of those "text file busy".
@project-administrator Mount a PV to /tf is already done. Disabling the plugin cache is fine but it downloads the provider every time. This make the downloading of provider each time which affects the NAT cost. We have around 300+ workspaces
Also, this is not happened at version 0.7
@balu-ce We're still running the TF version 0.12 because of this issue, and I thought the /tf PV mount and disabling the TF cache could mitigate the issue... But I have not considered the NAT costs and TF downloading the providers for each workspace every hour or so (depending on how many workspaces you have and your concurrency settings).
If you have an issue with the latest TF provider version, you might try asking it here: https://github.com/upbound/provider-terraform/issues/234
Also, it looks like the provider still does the "terraform init" on every reconciliation: https://github.com/upbound/provider-terraform/issues/230 I guess with the plugin-cache disabled, terraform will be downloading the providers until this one is fixed..
I'm also interested in getting it to work properly, but from your experience, it sounds like it's not working properly yet even with PV mounted to /tf.
Note that the providers won't be re-downloaded for every reconcile - they're only downloaded when the workspace is created, or if the contents change to require new/different providers.
When we refer to 'the plugin cache', that's about sharing providers between workspaces, but without that the providers are still saved locally inside the /tf/
The reason I suggest using the PV when not using the plugin cache is so that the per-workspace saved providers are not re-downloaded if the pod restarts.
@toastwaffle I can see that inside the running provider-terraform pod it would regularly delete the workspace directory under the "/tf" directory, and then re-create it empty, populate it from the git repository, and after that run "terraform init". Do you mean in this case it does not download the providers while it's running "terraform init"? By "regularly delete the workspace directory under the '/tf'" I mean that under the /tf directory all directories have a creation time of less than an hour. I.e. if I do "cd /tf/fb37694c-15e5-4aaf-ad3b-47d705cace8c", then wait for 30-60 min, and run "ls -la" there, then I would get an error that the directory does not exist, and I can get back to a new recreated directory by running "cd /tf/fb37694c-15e5-4aaf-ad3b-47d705cace8c". So I don't get how PV fixes the fact that each workspace directory gets recreated constantly.
That's very surprising - I've only ever used inline workspaces which don't get recreated, so it must be something about how the git-backed workspaces are managed. Probably worth creating a separate issue for that, as it seems very inefficient to clone the git repository from scratch for every reconcile
The provider does remove the current workspace and recreate it every time for remote repos: https://github.com/upbound/provider-terraform/blob/main/internal/controller/workspace/workspace.go#L233
It is using go-getter which has issues when retrieving a git repo into an existing directory (https://github.com/hashicorp/go-getter/issues/114), so the workaround is to remove the directory and re-pull the repo. Definitely not very efficient.
It might be better to use go-git to clone the remote repo and then use fetch
to determine if the repo has been updated, and only run terraform init
when fetch
returns new commits.
OK, so if I'm getting it right, currently there is no reliable way to use the "--max-reconcile-rate" greater than "1" with the remote repo?
If the plugin cache is enabled, then you get some "text file busy" randomly. If the plugin cache is disabled, then you get lots of traffic re-downloading TF providers on each reconciliation.
If the plugin cache is disabled, then you get lots of traffic re-downloading TF providers on each reconciliation.
Well it's pretty reliably doing that, but yes - the current implementation of remote repos seems to limit the concurrency to 1.
I wonder if we could use TF_DATA_DIR (https://developer.hashicorp.com/terraform/cli/config/environment-variables#tf_data_dir) to store the .terraform
directory somewhere else so that it's not removed when the download directory is removed, and then presumably if the remote content hasn't changed the checksum would be the same and it would not be necessary to run terraform init
.
I don't know if this is directly related to #230 but it seems like it might be similar. If we move the .terraform
directory for all Workspaces to a separate path it might solve both problems.
What happened?
Fairly regularly, we're seeing reconciliations back up behind long running terraform apply/destroy operations. We have ~180 workspaces of two types: those which provision GKE clusters (which take 10-15 minutes to apply/destroy), and those which provision DNS records (which take 10-20 seconds to apply/destroy). On average there are twice as many DNS workspaces as GKE workspaces.
I wrote a tool to parse the provider debug logs to illustrate this; the results are in this spreadsheet. The results show each individual reconciliation, ordered by the time at which they started/finished. For confidentiality reasons the workspace names have been replaced with hashes (the first 8 bytes of a sha256 hash).
There are multiple examples of the errant behaviour, but to pick one let's look at row 3469 of the 'Ordered by Start Time' sheet:
did init
field on this row isunknown
because there is no logging when a Workspace's checksum is blank.What this looks like in the logs is:
This behaviour can be seen in the metrics - each time we have a GKE cluster which needs creating/deleting, the number of active workers starts to grow and the reconcile rate drops to 0. Once it finishes, we see a big spike in the reconcile rate as all the backup up reconciles finish at once.
Hypotheses and debugging
RLock
, which should allow for multiple concurrent operationsWhat environment did it happen in?