Reconciliations blocking unexpectedly

toastwaffle commented 5 months ago

What happened?

Fairly regularly, we're seeing reconciliations back up behind long running terraform apply/destroy operations. We have ~180 workspaces of two types: those which provision GKE clusters (which take 10-15 minutes to apply/destroy), and those which provision DNS records (which take 10-20 seconds to apply/destroy). On average there are twice as many DNS workspaces as GKE workspaces.

I wrote a tool to parse the provider debug logs to illustrate this; the results are in this spreadsheet. The results show each individual reconciliation, ordered by the time at which they started/finished. For confidentiality reasons the workspace names have been replaced with hashes (the first 8 bytes of a sha256 hash).

There are multiple examples of the errant behaviour, but to pick one let's look at row 3469 of the 'Ordered by Start Time' sheet:

We can see that this is a creation of a GKE cluster which took ~10 minutes to complete, finishing at 2024-02-01T11:13:24Z. Note that the did init field on this row is unknown because there is no logging when a Workspace's checksum is blank.
The next 3 reconciliations (of up-to-date workspaces) took less than 4 seconds each as expected
Row 3473 is the first example which appeared to get stuck; it finished ~30 seconds after row 3469
Row 3474 is another create operation. This appeared to start ~2 minutes after the first, but then took ~20 minutes instead of ~10. Reading GCP logs, it looks like that cluster wasn't actually created until 2024-02-01T11:14:20Z, which suggests that reconciliation didn't actually start until the first creation completed
The next 25+ rows are other reconciliations which started after the first 4, and all completed at around 2024-02-01T11:14:06Z (42 seconds after the first reconciliation completed). This appears to be these reconciliations getting stuck behind the first.

What this looks like in the logs is:

# The first creation
2024-02-01T11:04:33.467Z    DEBUG   provider-terraform  Reconciling {"controller": "managed/workspace.tf.upbound.io", "request": {"name":"redacted-workspace-1"}}

# Other reconciliations start
2024-02-01T11:04:50.169Z    DEBUG   provider-terraform  Reconciling {"controller": "managed/workspace.tf.upbound.io", "request": {"name":"redacted-workspace-2"}}
2024-02-01T11:04:50.174Z    DEBUG   provider-terraform  Checksums match - skip running terraform init   {"request": "redacted-workspace-2"}
2024-02-01T11:05:09.230Z    DEBUG   provider-terraform  Reconciling {"controller": "managed/workspace.tf.upbound.io", "request": {"name":"redacted-workspace-3"}}
2024-02-01T11:05:09.234Z    DEBUG   provider-terraform  Checksums match - skip running terraform init   {"request": "redacted-workspace-3"}
# Repeated for many other workspaces

# The first creation finishes and the workspace is immediately requeued
2024-02-01T11:13:24.291Z    DEBUG   provider-terraform  Successfully requested creation of external resource    {"controller": "managed/workspace.tf.upbound.io", "request": {"name":"redacted-workspace-1"}, "uid": "0254a9c2-8a4f-4209-aba0-60937789a667", "version": "947058634", "external-name": "", "external-name": "redacted-workspace-1"}
2024-02-01T11:13:24.305Z    DEBUG   provider-terraform  Reconciling {"controller": "managed/workspace.tf.upbound.io", "request": {"name":"redacted-workspace-1"}}

# Other reconciliations finish (not in the same order that they started)
2024-02-01T11:13:55.363Z    DEBUG   provider-terraform  External resource is up to date {"controller": "managed/workspace.tf.upbound.io", "request": {"name":"redacted-workspace-4"}, "uid": "2b811eb9-b837-4e42-8515-5b7a8656eecd", "version": "946986250", "external-name": "redacted-workspace-4", "requeue-after": "2024-02-01T11:44:55.013Z"}
2024-02-01T11:14:04.429Z    DEBUG   provider-terraform  External resource is up to date {"controller": "managed/workspace.tf.upbound.io", "request": {"name":"redacted-workspace-5"}, "uid": "1a9cfddf-9228-4a31-a0a4-6e69902b9b1d", "version": "946873818", "external-name": "redacted-workspace-5", "requeue-after": "2024-02-01T11:55:20.401Z"}
# Repeated for many other workspaces

This behaviour can be seen in the metrics - each time we have a GKE cluster which needs creating/deleting, the number of active workers starts to grow and the reconcile rate drops to 0. Once it finishes, we see a big spike in the reconcile rate as all the backup up reconciles finish at once.

provider-terraform-graphs

Hypotheses and debugging

My first instinct was to look at the lock which guards the TF plugin cache, but reading the code the only thing which does a write lock is a call to Init(), which happens rarely (except for after the provider restarts, which invalidates all of the checksums). Every other terraform operation uses RLock, which should allow for multiple concurrent operations
I'm currently working on reproducing this with a smaller number of resources, with a doctored provider image which wraps the terraform binary to log all invocations

What environment did it happen in?

Crossplane Version: 0.14.2
Provider Version: 0.12.0
Kubernetes Version: v1.24.17-gke.2266000
Kubernetes Distribution: GKE

bobh66 commented 5 months ago

@toastwaffle - are you running with --max-reconcile-rate set to a value larger than 1? By default the provider runs with that parameter set to 1 which means it is essentially single-threaded. The provider uses the terraform CLI so it needs a CPU for each thread (potentially). If you set max-reconcile-rate to 5, it could use up to 5 CPUs to run 5 concurrent reconciliations. See https://github.com/upbound/provider-terraform/blob/main/README.md#known-limitations

toastwaffle commented 5 months ago

@bobh66 yes we are, but if you look at the graphs above we're not CPU-bound. We currently have --max-reconcile-rate=50, with 4 CPUs requested and a limit of 7.5. I did wonder if this was related (see this comment on a different issue), but if we were being limited to our CPU count I'd expect reconciles to still be making progress instead of getting completely stuck.

bobh66 commented 5 months ago

I wonder if the large difference between max-reconcile-rate and the available CPU count is a factor? It would be interesting to see what happens if you set them both the same value (4 in this case). If you still see the problem then I have to assume the locking is not working as expected. I'll take a look at the locking code again and see if anything obvious jumps out.

bobh66 commented 5 months ago

There may be a quirk to the way the locking works that is problematic. If I'm reading the code correctly:

There can be only one Lock holder at a time
There can be many RLock holders
RLock requests will block if a Lock is held
Lock requests will block until all current RLocks are released AND will block the acquisition of new RLocks until it has acquired the lock AND finished it's work

This can result in a long-running RLock process blocking the acquisition of new RLocks if there is also a Lock pending.

The Lock request will wait for the long-running RLock process to finish before it runs, and meanwhile will block all other RLock requests so we end up single-threaded until the long-running RLock thread is finished, at which time the Lock process runs and then unblocks all of the other RLock requests.

Practically this means that any time a new Workspace is created, regardless of it's content, because it needs to run terraform init it will block the reconciliation of all other Workspaces while it waits for any current RLock processes to finish.

This is all conjecture for the moment but I think it sounds plausible. Figuring out how to solve it is the next problem.

toastwaffle commented 5 months ago

Oh, yeah, that makes sense, and explains why in the example above a few things did get processed while the first create was running, until the second create entered the metaphorical "queue" and called Lock()

One potential workaround: can we avoid locking entirely if the plugin cache is disabled? Currently the lock is used regardless of whether the plugin cache is enabled

bobh66 commented 5 months ago

Just thinking out loud - do we need to be RLock'ing for terraform plan? If it fails due to some race condition we'll just retry it on the next reconciliation anyway. init and apply definitinitely need to Lock/RLock, and I'm not sure about delete. But it seems like we could mitigate much of the problem if we didn't lock on plan and just allow the reconciliation to try again if it fails due to some race condition. @ytsarev @negz ?

bobh66 commented 5 months ago

One potential workaround: can we avoid locking entirely if the plugin cache is disabled? Currently the lock is used regardless of whether the plugin cache is enabled

That's a possibility too - the only reason for locking is to prevent cache corruption issues, so if the cache is disabled there is no need to lock.

toastwaffle commented 5 months ago

I can make a PR for that.

Not locking for a plan operation would also help, but we'd still potentially underutilise concurrent reconciles. Challenging to fix without risking starvation though :/

bobh66 commented 5 months ago

Yes - new resource creation during long-running applies/deletes will still cause throttling but at least the other reconciliations could still run assuming no changes in the terraform plan output.

bobh66 commented 5 months ago

A better overall solution may be to replace the in-memory storage of terraform workspaces with a persistent volume that can be sized as needed. That would allow sufficient storage space to run with caching disabled without using excessive memory, and it also allows for pod updates to not have to re-run terraform init on all Workspaces since the workspaces are persistent. I'm pretty sure we could do this today using a DeploymentRuntimeConfig mounted on /tf

toastwaffle commented 5 months ago

It doesn't even need to be mounted on /tf; the XP_TF_DIR env var can make it use a different location

ytsarev commented 5 months ago

@bobh66 we intentionally avoid the lock on plan https://github.com/upbound/provider-terraform/blob/main/internal/terraform/terraform.go#L503 so your idea sounds safe.

bobh66 commented 5 months ago

~-I think 503 is the creation of the Command object but it doesn't get executed until https://github.com/upbound/provider-terraform/blob/main/internal/terraform/terraform.go#L514 which is after the RLock() call.-~

toastwaffle commented 5 months ago

I think @ytsarev was referring to the -lock=false flag?

Should I remove the locking for plans in #240, or is that best done separately?

bobh66 commented 5 months ago

I think @ytsarev was referring to the -lock=false flag?

You're right - my mistake. Apologies @ytsarev

bobh66 commented 5 months ago

Should I remove the locking for plans in https://github.com/upbound/provider-terraform/pull/240, or is that best done separately?

I think it makes sense to do it in #240

toastwaffle commented 4 months ago

Just to report back on this - I just upgraded our production cluster to 0.14.1 successfully, and discovered a small side effect of not locking for plans. As the provider was starting up and recreating all of the workspaces (hopefully for the last time as we added a persistent volume at the same time), we saw the same plugin cache conflict errors that we saw previously:

Error: Failed to install provider

Error while installing hashicorp/google-beta v5.0.0: open
/tf/plugin-cache/registry.terraform.io/hashicorp/google-beta/5.0.0/linux_amd64/terraform-provider-google-beta_v5.0.0_x5:
text file busy

What I think is happening here is that while we do only have one terraform init running, the fact that terraform plan operations can run at the same time causes a conflict. I didn't reproduce that in my local testing because I wasn't creating Workspaces fast enough to trigger it, and I already had the persistent volume.

The backoff and retries resolved this eventually - on startup it took about 15 minutes to re-initialise our 150 workspaces (with --max-reconcile-rate=50; our maximum CPU usage was 2.5 CPU). However, those errors aren't ideal, so maybe we want to reinstate the locking for terraform plan?

bobh66 commented 4 months ago

I saw the same behavior but I'm not sure it merits reinstating the lock for plan. My next step is to create a PVC and mount it to the pod so that the /tf directory is persistent across reboots. That should eliminate the need to run init on all Workspaces on a restart, and also preserve the cache content. That seems like a more stable solution in general, especially for installations with large numbers of Workspaces.

toastwaffle commented 4 months ago

Ack, the PV has definitely helped us. Looking at our graphs today with lots of Workspaces being created and destroyed, I'm confident that this is now fixed.

Thank you Bob for figuring out the root cause and helping get the fix reviewed and released!

zach-source commented 4 months ago

Ah sorry for all the issues with the locks. The terraform cache is really really picky. I feel like the execution model for the terraform plugin is just poor in general. A PVC would help. I think sharding would also help greatly if you have a lot of terraform workspaces. @toastwaffle can you share your grafana graph json by chance? I would like to analyze our pipelines similarly.

toastwaffle commented 4 months ago

We're using some recording rules and custom metrics, but the raw PromQL queries are:

sum by (controller)(controller_runtime_max_concurrent_reconciles{controller="managed/workspace.tf.upbound.io"})
sum by (controller)(controller_runtime_active_workers{controller="managed/workspace.tf.upbound.io"})
sum by (controller)(rate(controller_runtime_reconcile_total{controller="managed/workspace.tf.upbound.io"}[5m]))
sum by ()(rate(container_cpu_usage_seconds_total{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"}[5m]))
sum by ()(container_memory_working_set_bytes{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"})
sum by ()(container_memory_usage_bytes{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"})

balu-ce commented 3 weeks ago

I am facing the issue even on PV enabled .

Warning: Unable to open CLI configuration file │ │ The CLI configuration file at "./.terraformrc" does not exist. ╵

Error: Failed to install provider

Error while installing hashicorp/aws v5.44.0: open /tf/plugin-cache/registry.terraform.io/hashicorp/aws/5.44.0/linux_arm64/terraform-provider-aws_v5.44.0_x5: text file busy

Error: Failed to install provider

Error while installing hashicorp/kubernetes v2.30.0: open /tf/plugin-cache/registry.terraform.io/hashicorp/kubernetes/2.30.0/linux_arm64/terraform-provider-kubernetes_v2.30.0_x5: text file busy

project-administrator commented 3 weeks ago

@balu-ce I guess you might need to disable completely the TF provider cache (maybe by using some environment variable or by putting a configuration value into the /.terraformrc). It looks like no one mentioned this, but the fix implies both:

mount a PV to /tf
disable TF shared provider cache

After that it makes perfect sense that you don't get any of those "text file busy".

balu-ce commented 3 weeks ago

@project-administrator Mount a PV to /tf is already done. Disabling the plugin cache is fine but it downloads the provider every time. This make the downloading of provider each time which affects the NAT cost. We have around 300+ workspaces

Also, this is not happened at version 0.7

project-administrator commented 3 weeks ago

@balu-ce We're still running the TF version 0.12 because of this issue, and I thought the /tf PV mount and disabling the TF cache could mitigate the issue... But I have not considered the NAT costs and TF downloading the providers for each workspace every hour or so (depending on how many workspaces you have and your concurrency settings).

If you have an issue with the latest TF provider version, you might try asking it here: https://github.com/upbound/provider-terraform/issues/234

Also, it looks like the provider still does the "terraform init" on every reconciliation: https://github.com/upbound/provider-terraform/issues/230 I guess with the plugin-cache disabled, terraform will be downloading the providers until this one is fixed..

I'm also interested in getting it to work properly, but from your experience, it sounds like it's not working properly yet even with PV mounted to /tf.

toastwaffle commented 3 weeks ago

Note that the providers won't be re-downloaded for every reconcile - they're only downloaded when the workspace is created, or if the contents change to require new/different providers.

When we refer to 'the plugin cache', that's about sharing providers between workspaces, but without that the providers are still saved locally inside the /tf/ directory (iirc the shared plugin cache works by hard-linking the providers into that directory).

The reason I suggest using the PV when not using the plugin cache is so that the per-workspace saved providers are not re-downloaded if the pod restarts.

project-administrator commented 3 weeks ago

@toastwaffle I can see that inside the running provider-terraform pod it would regularly delete the workspace directory under the "/tf" directory, and then re-create it empty, populate it from the git repository, and after that run "terraform init". Do you mean in this case it does not download the providers while it's running "terraform init"? By "regularly delete the workspace directory under the '/tf'" I mean that under the /tf directory all directories have a creation time of less than an hour. I.e. if I do "cd /tf/fb37694c-15e5-4aaf-ad3b-47d705cace8c", then wait for 30-60 min, and run "ls -la" there, then I would get an error that the directory does not exist, and I can get back to a new recreated directory by running "cd /tf/fb37694c-15e5-4aaf-ad3b-47d705cace8c". So I don't get how PV fixes the fact that each workspace directory gets recreated constantly.

toastwaffle commented 3 weeks ago

That's very surprising - I've only ever used inline workspaces which don't get recreated, so it must be something about how the git-backed workspaces are managed. Probably worth creating a separate issue for that, as it seems very inefficient to clone the git repository from scratch for every reconcile

bobh66 commented 3 weeks ago

The provider does remove the current workspace and recreate it every time for remote repos: https://github.com/upbound/provider-terraform/blob/main/internal/controller/workspace/workspace.go#L233

It is using go-getter which has issues when retrieving a git repo into an existing directory (https://github.com/hashicorp/go-getter/issues/114), so the workaround is to remove the directory and re-pull the repo. Definitely not very efficient.

It might be better to use go-git to clone the remote repo and then use fetch to determine if the repo has been updated, and only run terraform init when fetch returns new commits.

project-administrator commented 3 weeks ago

OK, so if I'm getting it right, currently there is no reliable way to use the "--max-reconcile-rate" greater than "1" with the remote repo?

If the plugin cache is enabled, then you get some "text file busy" randomly. If the plugin cache is disabled, then you get lots of traffic re-downloading TF providers on each reconciliation.

bobh66 commented 3 weeks ago

If the plugin cache is disabled, then you get lots of traffic re-downloading TF providers on each reconciliation.

Well it's pretty reliably doing that, but yes - the current implementation of remote repos seems to limit the concurrency to 1.

I wonder if we could use TF_DATA_DIR (https://developer.hashicorp.com/terraform/cli/config/environment-variables#tf_data_dir) to store the .terraform directory somewhere else so that it's not removed when the download directory is removed, and then presumably if the remote content hasn't changed the checksum would be the same and it would not be necessary to run terraform init.

I don't know if this is directly related to #230 but it seems like it might be similar. If we move the .terraform directory for all Workspaces to a separate path it might solve both problems.

upbound / provider-terraform