gruntwork-io / terragrunt

Terragrunt is a flexible orchestration tool that allows Infrastructure as Code written in OpenTofu/Terraform to scale.
https://terragrunt.gruntwork.io/
MIT License
8.06k stars 978 forks source link

Standalone provider cache server #3363

Open erpel opened 2 months ago

erpel commented 2 months ago

Summary

Starting from the interest in #3231 this is making the case for providing a terragrunt subcommand to start a stand alone provider cache server. In the discussion, creating a RFC is suggested.

Having a single cache server instance serving many parallel terragrunt invocations would improve efficiency on Terraform Automation & Collaboration Software (TACOS) platforms like Atlantis.

Motivation

Organisations running self hosted TACOS platforms or similar services like Atlantis for merge request workflow integration often find themselves with systems that will run terragrunt plan/apply and other commands frequently and in parallel. Existing provider caching has proven lacking for such situations, which lead to terragrunt adding provider caching functionality.

For these systems having a single cache storage location is the most efficient way to use the cache. Past issues have shown that spinning up many cache servers in parallel pointing to the same directories can lead to locking issues among others.

Running a single permanent cache server on the automation system host allows for the most efficient use of cache. All terragrunt processes launched could connect to the same cache server serving providers from a unified cache location without locking issues.

Proposal

Introduce a new subcommand like terragrunt cache-server. This command does nothing but start a cache server and not return unless a fatal error is encountered or signal to stop is received. Server parameters can be set using established terragrunt configuration methods like command line arguments or environment variables. These parameters include cache dir location, registry configuration, listening host/port and the authentication token.

Users of a system like atlantis could add the cache server process to the host/pod running atlantis and extend the atlantis configuration to invoke terragrunt in pipelines with the settings needed to connect to the standalone cache server. This would include enabling caching, providing a server URI and authentication token. Adding these options would differ depending on the actual TACOS used.

Technical Details

Press Release

Standalone provider cache server for efficient TACOS hosting

Terragrunt introduces the ability to run a standalone cache server giving TACOS operators more control to ensure efficient reuse of downloaded providers.

A single cache server process enables limitless parallelism enabling operators to scale workflow automation efficiently with minimal overhead.

Standalone cache server is available as of [RELEASE]. To learn more about how to integrate with your selfhosted TACAS, check the documentation.

Drawbacks

Operations for the additional component of the cache server increases the overhead for teams providing TACOS. This includes but is not limited to keeping it up to date and monitoring availability.

Terragrunt might need to improve handling situations where a cache server is supposed to be used but can't be reached, added complexity will complicate troubleshooting in some scenarios.

Sharing a cache server with untrusted entities could be enabled through this and might bring security issues like cache poisoning into the setup.

A long running cache server shared across many terragrunt invocations may increase the requirements for the cache server implementation itself compared over running a server for shorter times with limited scope.

Alternatives

Migration Strategy

None required

Unresolved Questions

Are there other use cases for this outside of hosting systems that run terragrunt as a service integrated into team workflows (TACOS)?

Do other systems bring additional requirements to be able to integrate a standalone server?

Would running a central server on a network location be a useful scenario? This might introduce many additional security considerations compared to running via localhost only.

References

Proof of Concept Pull Request

No response

Support Level

Customer Name

No response

yhakbar commented 2 months ago

Deleted a comment that was likely an attempt to get folks to download malware. Reported the user to GitHub.

yhakbar commented 2 months ago

How do you imagine this external cache server is hosted, @erpel ? As a separate container/server with a file system mount to allow for the cached providers to be accessed?

Can you explain why multiple Terragrunt processes are running instead of a run-all invocation? If I'm understanding right, that would result in one Terragrunt process spinning up one go routine for the server, and multiple routines for the underlying OpenTofu/Terraform executions, right?

erpel commented 2 months ago

Thanks for your interest.

In our situation I'd like to add the cache server as a separate container in the Atlantis pod. Configure it to listen on localhost and using an EFS file system mounted in both the main and the cache container at the same path. The question about the file system made me realize that a cache server on a different host makes no sense, as file system access for both sides, cache server and client is required.

Our setup with Atlantis has one instance covering several repositories and some are "monorepos" with many teams working on them. We're not using any of the run-all commands at the moment, our structure is not laid out in a way that makes that immediately useful. Even with that remedied, unrelated MRs would still be running as separate invocations, so terragrunt is likely to always be active multiple times in parallel.

gnuletik commented 1 month ago

We are also looking for this feature but for a different use-case.

When we are running multiple terragrunt commands like this :

terragrunt apply -target 'aws_s3_bucket_policy.bucket_policy[0]'
terragrunt apply -target 'aws_s3_bucket_policy.bucket_policy[1]'

We are waiting for the provider cache to start / stop for each command.

On projects with many providers, starting the provider cache can be slow (~30 seconds). This is quite cumbersome having to wait for the provider cache to start between each commands when it could be re-used between commands.

This can become impossible to use the provider cache if you have a lot of these commands to run.

alonalmog82 commented 3 days ago

I +1 all of the above use cases - they all apply to our implementation.

I'll add that even without a TF collaboration framework, a single developer working on a dozen terragrunt modules, planning and testing multiple times, I would have enabled a local cache server on my laptop. Avoid the 2 sec spin-up time X 100 plans I would do that day, it's annoying.

p5 commented 2 days ago

Throwing thoughts out there. Could this just utilise an existing object storage, rather than being an entire new service to host? Like configuration which sets the equivalent of something like:

plugin_cache_dir = "s3://my-cache-bucket/plugin-cache"

The administrator can then choose whether this is an S3 or MinIO bucket, or some K8s volume management service (Longhorn? Not too familiar with K8s)

alonalmog82 commented 9 hours ago

P5, using an external object store defeats the purpose of using a local cache. Additionally, this will require setting an additional object store, and force us to handle authentication with it.