Open brikis98 opened 9 months ago
- Providers. By default, Terraform downloads providers from scratch every time you run
init
. This isn't a problem for a single module, but if you dorun-all
in a repo that has, say, 50terragrunt.hcl
files, each one runsinit
on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!
We can offer users to use TF features such as provider plugin caching, but there is a drawback because terraform does not guarantee safe concurrency: Terraform issue #31964, and Terragrunt issue #1875, thus we cannot perform initialization in parallel for example 50 terragrunt.hcl
, therefore the --terragrunt-parallelism 1
flag is mandatory. The solution may be to implement our own provider loader, by reviewing the Terraform code and integrating a similar logic into the Terragrunt code, but taking into account:
HOSTNAME/NAMESPACE/TYPE/VERSION/TARGET
.terraform/providers/
with symbolic links for each module or use provider_installation that does actually the same.
- Repos. When you set a
source
URL in TG, it downloads the whole repo into a.terragrunt-cache
folder. If you have 50terragrunt.hcl
files withsource
URLs, and dorun-all
, it will download the repos 50 times—even if all 50 repos are the same!
The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:
.terragrunt-cache
directories are where Terraform creates the .terraform
directories, we again run into a concurrency issue.*.tf/*.hcl
files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from .terragrunt-cache
are left in their original state to avoid conflicts.
- Git clone. I think TG is doing a "full" Git clone for the code in the
source
URL. We should consider doing a shallow clone, as that would be much faster/smaller.
It will be easy. I checked, a shallow clone is about 1/3-1/2 times smaller.
- Modules. When you run
init
, Terraform downloads repos to a.modules
folder. If you have 50terragrunt.hcl
files, each of which has asource
URL pointing to TF code that contains, say, 10 modules, then when you dorun-all
, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!
As far as I know, Terraform does not have the module caching feature. But we can implement it in the same way as with providers, that is, download modules into a common cache directory, and then create symbolic links. I checked, terraform works well with module dirs that refer to another dirs. To summarize, I would suggest implementing something like terraform get
, but our own implementation run-all get
based on the Terraform code.
- Ephemeral storage. Many users run TG in places with ephemeral storage—e.g., in a K8S cluster—where the disk is totally empty/fresh on each run. So all the stuff you downloaded in previous runs isn't even available, and you have to download everything from scratch each time.
I don't know how to solve this issue. Any suggestions?
I just took a quick look at the Terraform code, and unfortunately the code we need is located in the internal
directory, so we cannot use it as a golang package without copying it into our codebase.
Of course, the obvious disadvantage is that if they suddenly radically change the module and provider loading, we will also need to update our code. But given that this can happen, we can delivery a new caching feature to users not as a default option, but as the deliberately choice. In other words, they should explicitly run run-all init
/run-all get
/run-all cache
/...
, (I'm not sure what can be the best command name for this feature) that creates shared cache.
- Providers. By default, Terraform downloads providers from scratch every time you run
init
. This isn't a problem for a single module, but if you dorun-all
in a repo that has, say, 50terragrunt.hcl
files, each one runsinit
on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!We can offer users to use TF features such as provider plugin caching, but there is a drawback because terraform does not guarantee safe concurrency: Terraform issue #31964, and Terragrunt issue #1875, thus we cannot perform initialization in parallel for example 50
terragrunt.hcl
, therefore the--terragrunt-parallelism 1
flag is mandatory. The solution may be to implement our own provider loader, by reviewing the Terraform code and integrating a similar logic into the Terragrunt code, but taking into account:
- First scan Terragant modules, create a list of necessary Terraform providers.
- Download all providers in parallel into a shared cache directory with the structure as for provider_installation
HOSTNAME/NAMESPACE/TYPE/VERSION/TARGET
- Create directories
.terraform/providers/
with symbolic links for each module or use provider_installation that does actually the same.
I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading providers. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change.
Here's a bit of a zany idea that leverages their public API: in the provider_installation
configuration, you can specify a network_mirror
. What if when you run Terragrunt, it:
localhost
(not 0.0.0.0
) so it's accessible only from the local computerlocalhost
.network_mirror
.We'd probably make this feature opt-in, at least initially. Once you turn it on, you get provider caching automatically, in a way that should be concurrency safe.
The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:
- Since the
.terragrunt-cache
directories are where Terraform creates the.terraform
directories, we again run into a concurrency issue.- Terragrunt copies its
*.tf/*.hcl
files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from.terragrunt-cache
are left in their original state to avoid conflicts.
Yea, both of these are valid issues. Any ideas on solutions? Do the suggestions in https://github.com/gruntwork-io/terragrunt/issues/2923, especially a content-addressable store similar to pnpm with symlinks offer a potential solution?
- Git clone. I think TG is doing a "full" Git clone for the code in the
source
URL. We should consider doing a shallow clone, as that would be much faster/smaller.It will be easy. I checked, a shallow clone is about 1/3-1/2 times smaller.
It's easy for most things, but as I found out in https://github.com/gruntwork-io/terragrunt/pull/2893, one issue we hit with shallow clones is that the catalog
command uses the Git repo to look up tags, which you can't do with a shallow clone. So we may need some conditional logic where we use shallow clones by default, but if something needs to do a look up in Git history, we swap to a full clone.
- Modules. When you run
init
, Terraform downloads repos to a.modules
folder. If you have 50terragrunt.hcl
files, each of which has asource
URL pointing to TF code that contains, say, 10 modules, then when you dorun-all
, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!As far as I know, Terraform does not have the module caching feature. But we can implement it in the same way as with providers, that is, download modules into a common cache directory, and then create symbolic links. I checked, terraform works well with module dirs that refer to another dirs. To summarize, I would suggest implementing something like
terraform get
, but our own implementationrun-all get
based on the Terraform code.
I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading modules. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change.
Are there any hooks for downloading modules? For example, how does Terraform work in an air gapped environment?
- Ephemeral storage. Many users run TG in places with ephemeral storage—e.g., in a K8S cluster—where the disk is totally empty/fresh on each run. So all the stuff you downloaded in previous runs isn't even available, and you have to download everything from scratch each time.
I don't know how to solve this issue. Any suggestions?
This would mostly be about documenting how to persist data, such as a provider cache, in a K8S cluster: e.g., with persistent volumes.
- Providers. By default, Terraform downloads providers from scratch every time you run
init
. This isn't a problem for a single module, but if you dorun-all
in a repo that has, say, 50terragrunt.hcl
files, each one runsinit
on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading providers. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change.
Agree.
Here's a bit of a zany idea that leverages their public API: in the
provider_installation
configuration, you can specify anetwork_mirror
. What if when you run Terragrunt, it:
Fires up a web server listening on localhost that implements the provider mirror network protocol (maybe there's even open source Go code that does this already).
- Actually, we might first ping the URL to see if a TG server is already running there, and if so, just use that one. This handles the case where you have multiple instances of TG running concurrently.
- The server should only listen on
localhost
(not0.0.0.0
) so it's accessible only from the local computer- We should also configure randomly-generated credentials for the server so random websites you open on your computer can't make requests to
localhost
.- Configures Terraform to use that localhost URL as its
network_mirror
.When Terraform queries the localhost server for a provider, the server, in memory, can maintain locks on a per-provider basis:
- If no one has the lock for this provider already, Terragrunt checks the local disk (in a predictable file path) to see if the provider is already there. If it is, it gives Terraform a local file path. If it's not already on disk, it tells Terraform to download the provider from whatever the original URL was to the predictable file path on disk (I think this is doable with the provider mirror network protocol, but not 100% sure).
- If someone already has the lock, then Terragrunt has Terraform wait, and then looks up the provider from disk.
We'd probably make this feature opt-in, at least initially. Once you turn it on, you get provider caching automatically, in a way that should be concurrency safe.
Great idea. Yes, there are, one of them https://github.com/terralist/terralist supports Module and Provider registries.
But I am not sure that this will solve all our issues. Yes, we can reduce the amount of traffic by reading the already downloaded plugin from disk, but each terraform process will stores the received plugin from the mirror network in its .terraform
directory. So the disk usage will be the same + one more copy for (proxy) private register.
The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:
- Since the
.terragrunt-cache
directories are where Terraform creates the.terraform
directories, we again run into a concurrency issue.- Terragrunt copies its
*.tf/*.hcl
files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from.terragrunt-cache
are left in their original state to avoid conflicts.Yea, both of these are valid issues. Any ideas on solutions?
.terraform
data separately from repositories by changing the path with TF_DATA_DIRDo the suggestions in #2923, especially a content-addressable store similar to pnpm with symlinks offer a potential solution?
In the case of npm, this is justified, since npm itself understands where to get which files from, in our case, terraform needs to be provided a regular file structure, and for this we would have to create hundreds or thousands of symlinks, and the creation of such a database itself is not trivial task, considering that the modules themselves are not so big as plugins, to spend so much time and resources on it, not sure.
- Git clone. I think TG is doing a "full" Git clone for the code in the
source
URL. We should consider doing a shallow clone, as that would be much faster/smaller.It will be easy. I checked, a shallow clone is about 1/3-1/2 times smaller.
It's easy for most things, but as I found out in #2893, one issue we hit with shallow clones is that the
catalog
command uses the Git repo to look up tags, which you can't do with a shallow clone. So we may need some conditional logic where we use shallow clones by default, but if something needs to do a look up in Git history, we swap to a full clone.
Ah, will keep this in mind, thanks.
- Modules. When you run
init
, Terraform downloads repos to a.modules
folder. If you have 50terragrunt.hcl
files, each of which has asource
URL pointing to TF code that contains, say, 10 modules, then when you dorun-all
, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!Are there any hooks for downloading modules? For example, how does Terraform work in an air gapped environment?
We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.
- Ephemeral storage. Many users run TG in places with ephemeral storage—e.g., in a K8S cluster—where the disk is totally empty/fresh on each run. So all the stuff you downloaded in previous runs isn't even available, and you have to download everything from scratch each time.
I don't know how to solve this issue. Any suggestions?
This would mostly be about documenting how to persist data, such as a provider cache, in a K8S cluster: e.g., with persistent volumes.
Ah, understood.
Here's a bit of a zany idea that leverages their public API: in the
provider_installation
configuration, you can specify anetwork_mirror
. What if when you run Terragrunt, it:
Fires up a web server listening on localhost that implements the provider mirror network protocol (maybe there's even open source Go code that does this already).
- Actually, we might first ping the URL to see if a TG server is already running there, and if so, just use that one. This handles the case where you have multiple instances of TG running concurrently.
- The server should only listen on
localhost
(not0.0.0.0
) so it's accessible only from the local computer- We should also configure randomly-generated credentials for the server so random websites you open on your computer can't make requests to
localhost
.- Configures Terraform to use that localhost URL as its
network_mirror
.When Terraform queries the localhost server for a provider, the server, in memory, can maintain locks on a per-provider basis:
- If no one has the lock for this provider already, Terragrunt checks the local disk (in a predictable file path) to see if the provider is already there. If it is, it gives Terraform a local file path. If it's not already on disk, it tells Terraform to download the provider from whatever the original URL was to the predictable file path on disk (I think this is doable with the provider mirror network protocol, but not 100% sure).
- If someone already has the lock, then Terragrunt has Terraform wait, and then looks up the provider from disk.
We'd probably make this feature opt-in, at least initially. Once you turn it on, you get provider caching automatically, in a way that should be concurrency safe.
Great idea. Yes, there are, one of them https://github.com/terralist/terralist supports Module and Provider registries. But I am not sure that this will solve all our issues. Yes, we can reduce the amount of traffic by reading the already downloaded plugin from disk, but each terraform process will stores the received plugin from the mirror network in its
.terraform
directory. So the disk usage will be the same + one more copy for (proxy) private register.
If we also enable the plugin cache (which TG could enable via env var automatically when executing terraform
), I think TF will use a symlink to the cache, rather than copying the whole thing again. Can you think of a quick & dirty way to test out these hypotheses and see if this is a viable path forward?
The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:
- Since the
.terragrunt-cache
directories are where Terraform creates the.terraform
directories, we again run into a concurrency issue.- Terragrunt copies its
*.tf/*.hcl
files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from.terragrunt-cache
are left in their original state to avoid conflicts.Yea, both of these are valid issues. Any ideas on solutions?
- We can store
.terraform
data separately from repositories by changing the path with TF_DATA_DIR- I cannot say now, but I'm sure there is a solution.
Alright, keep thinking about it in the background to see if you can come up with something. One thing I stumbled across recently that may be of use: https://github.com/hashicorp/terraform/issues/28309
Do the suggestions in #2923, especially a content-addressable store similar to pnpm with symlinks offer a potential solution?
In the case of npm, this is justified, since npm itself understands where to get which files from, in our case, terraform needs to be provided a regular file structure, and for this we would have to create hundreds or thousands of symlinks, and the creation of such a database itself is not trivial task, considering that the modules themselves are not so big as plugins, to spend so much time and resources on it, not sure.
Fair enough.
- Modules. When you run
init
, Terraform downloads repos to a.modules
folder. If you have 50terragrunt.hcl
files, each of which has asource
URL pointing to TF code that contains, say, 10 modules, then when you dorun-all
, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!Are there any hooks for downloading modules? For example, how does Terraform work in an air gapped environment?
We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.
I suspect disk space isn't as big of a concern with modules, as those are mostly text (whereas providers are binaries in the tens of MBs). The time spent re-downloading (re-cloning) things is probably the bigger concern there.
I'm not sure how useful this is as far as addressing the issue from within terragrunt, but figured I'd share in the sense that there are approaches users could take to address the issues within their own pipelines/workflows... It is complicated though, and also maybe abuses some implementation details of terraform. Definitely welcome the conversation and would appreciate any features within terragrunt that address these issues more directly!
One thing I started doing is maintaining a single terraform config of all the providers and modules that are in use across the whole project. I call it the vendor
config. This vendor
config provides no inputs to any module, and is used only for running terraform init
. For example:
terraform {
required_version = "1.6.5" # terraform-version
required_providers {
aws = {
source = "hashicorp/aws"
version = "5.35.0"
}
}
}
module "foo" {
source = "git::https://url/to/foo/module?ref=1.0.0"
}
module "bar" {
source = "git::https://url/to/bar/module?ref=1.0.0"
}
Then in the "real" terragrunt and terraform configs, the module source
points to the relative path to the .terraform
directory that was initialized. E.g.
source = "../../vendor/.terraform/modules/bar"
source = "../../vendor/.terraform/modules/foo"
Before running any terragrunt commands, we run terraform -chdir vendor init -backend=false -lock=false
to populate the provider and module cache. Combined with TF_PLUGIN_CACHE_DIR
, this setup ensures the providers and modules are only downloaded the one time over the network. We also use this setup to manage all module versions in a single place.
(The provider versions in that vendor
config also update the resulting .terraform.lock.hcl
, which we use across all stacks in the project. I think its a nice and clean way to manage the lock file, but I don't think that is quite as relevant to this particular issue.)
Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!
If we also enable the plugin cache (which TG could enable via env var automatically when executing
terraform
), I think TF will use a symlink to the cache, rather than copying the whole thing again.
I took a quick look at the terraform code https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go
Briefly step by step how terraform pluging cache works:
// Step 1: Which providers might we need to fetch a new version of?
// This produces the subset of requirements we need to ask the provider
// source about. If we're in the normal (non-upgrade) mode then we'll
// just ask the source to confirm the continued existence of what
// was locked, or otherwise we'll find the newest version matching the
// configured version constraint.
// Step 2: Query the provider source for each of the providers we selected
// in the first step and select the latest available version that is
// in the set of acceptable versions.
//
// This produces a set of packages to install to our cache in the next step.
// Step 3: For each provider version we've decided we need to install,
// install its package into our target cache (possibly via the global cache).
So the idea with the lockers might work, since it first queries which versions exist in the registry, and then checks which exists in the cache. But there may be issues keeping the connection, Terraform processes must wait for the private registry, until a plugin is downloaded, so a timeout may occur, in the case when the user’s Internet speed is low and the plugin is large.
@lorengordon suggested the interesting idea. Thanks @lorengordon! On the one hand, we don’t need any private registers, which eliminates a huge number of issues that we don’t yet know about, but on the other hand, we will have to implement such logic that will generate on fly such a config and replace the source
in other configs. But there is a drawback too, this will not work correctly with modules, since, unlike providers, only a specific version of the module is stored in .terraform/modules
, and in the case of 50 terragrout.hcl
there may be such a case, when the module is the same, but versions are different. A solution could be: instead of replacing the source
in configurations, we can create symlinks for each module in .terraform/providers
, and thus the plugins will be shared, but the modules will be downloaded individually.
By the way, I don't know if the symlink approach is workable on Windows OS at all. Should I check it or does someone already know the answer? :)
Can you think of a quick & dirty way to test out these hypotheses and see if this is a viable path forward?
Of course I can, but can you please confirm that this request is still relevant?
Repos. When you set a source URL in TG, it downloads the whole repo into a .terragrunt-cache folder. If you have 50 terragrunt.hcl files with source URLs, and do run-all, it will download the repos 50 times—even if all 50 repos are the same!
- We can store
.terraform
data separately from repositories by changing the path with TF_DATA_DIR- I cannot say now, but I'm sure there is a solution.
Alright, keep thinking about it in the background to see if you can come up with something.
Sure, will do.
One thing I stumbled across recently that may be of use: hashicorp/terraform#28309
Ah, very interested, I don’t know yet whether this will be useful to us, but will keep it mind. Thanks.
Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same! We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.
I suspect disk space isn't as big of a concern with modules, as those are mostly text (whereas providers are binaries in the tens of MBs). The time spent re-downloading (re-cloning) things is probably the bigger concern there.
Agree. What will be the decision?
But there is a drawback too, this will not work correctly with modules, since, unlike providers, only a specific version of the module is stored in .terraform/modules, and in the case of 50 terragrout.hcl there may be such a case, when the module is the same, but versions are different.
While we see supporting a single version of a module as a bonus, if I had to support multiple versions, I would do it by changing the module label. That label is what maps to the path in the .terraform/modules
directory. For example:
module "foo_1.0.0" {
source = "git::https://url/to/foo/module?ref=1.0.0"
}
module "foo_1.0.1" {
source = "git::https://url/to/foo/module?ref=1.0.1"
}
and then referencing paths like so:
source = "../../vendor/.terraform/modules/foo_1.0.0"
source = "../../vendor/.terraform/modules/foo_1.0.1"
One place I know of that my approach does fall over for modules though, is nested modules. If a vendor
module is itself referencing another remote module, there is no way I've yet figured out within terraform to capture and overwrite that nested remote source
.
While we see supporting a single version of a module as a bonus, if I had to support multiple versions, I would do it by changing the module label. That label is what maps to the path in the .terraform/modules directory.
Ah, right, this might work :) We need to weigh whether it’s worth parsing all the configs to change the name of the modules and its source, or accepting duplication of modules as a compromise.
One place I know of that my approach does fall over for modules though, is nested modules. If a
vendor
module is itself referencing another remote module, there is no way I know of within terraform to capture and overwrite that nested remotesource
.
Oh really, this idea won't work with nested terraform modules, since each terraform module creates its own .terraform
folder in its root.
For example, how does Terraform work in an air gapped environment?
I also support air-gapped environments. We only use modules that use source = git:https://...
, and then we mirror modules to an internally accessible host, and use the git url "insteadOf" option to rewrite git urls in our shell configs.
For providers, we host an accessible provider mirror and use the network_mirror
option in the .terraformrc
file.
Oh really, this idea won't work with nested terraform modules, since each terraform module creates its own .terraform folder in its root.
Modules are tricky overall anyway, since the terraform network mirror only supports providers, not modules. You'd have to use something like the host
option of .terraformrc
as suggested earlier, and a localhost implementation of the module registry. That's probably the easiest. Otherwise, you're parsing through the .terraform
directory for module blocks, swapping out remote source
for a filesystem location, running init, and recursively repeating that until everything is resolved to a local on-filesystem location.
@lorengordon, yeah, you are right! I think we shouldn’t bother so much with modules, since they usually take up several megabytes. I wouldn't touch them.
@brikis98, suggestion on how to resolve the issue of duplication of providers, based on @lorengordon suggestion.
terraform init
for this config. run-all ...
.This way, one terraform process download all providers at the same time, eliminating concurrency issue and in the end we have a cache with all the necessary providers.
- Parse all tf configs to create a single config of all providers.
One sticking point with that step, even for providers, is any config that uses a module with a remote source. Remote modules may have provider requirements also. "Parsing all tf configs" to figure out all the providers in use and their version constraints, necessarily involves retrieving all remote modules. And so we're now reinventing a lot of the plumbing around terraform init
.
There may be an optimization available though, if the .terragrunt.lock.hcl
files are checked-in/available locally. Parse all of those for the provider requirements....
One sticking point with that step, even for providers, is any config that uses a module with a remote source. Remote modules may have provider requirements also. "Parsing all tf configs" to figure out all the providers in use and their version constraints, necessarily involves retrieving all remote modules. And so we're now reinventing a lot of the plumbing around
terraform init
.
Ok, but if we also include the modules in the single config, it will also download providers of these modules, right? After that we can just remove these modules as garbage.
Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!
If we also enable the plugin cache (which TG could enable via env var automatically when executing
terraform
), I think TF will use a symlink to the cache, rather than copying the whole thing again.I took a quick look at the terraform code https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go
Briefly step by step how terraform pluging cache works:
// Step 1: Which providers might we need to fetch a new version of? // This produces the subset of requirements we need to ask the provider // source about. If we're in the normal (non-upgrade) mode then we'll // just ask the source to confirm the continued existence of what // was locked, or otherwise we'll find the newest version matching the // configured version constraint. // Step 2: Query the provider source for each of the providers we selected // in the first step and select the latest available version that is // in the set of acceptable versions. // // This produces a set of packages to install to our cache in the next step. // Step 3: For each provider version we've decided we need to install, // install its package into our target cache (possibly via the global cache).
So the idea with the lockers might work, since it first queries which versions exist in the registry, and then checks which exists in the cache. But there may be issues keeping the connection, Terraform processes must wait for the private registry, until a plugin is downloaded, so a timeout may occur, in the case when the user’s Internet speed is low and the plugin is large.
Did you actually test this out and see a timeout issue? Or are you just guessing that it might be an issue?
By the way, I don't know if the symlink approach is workable on Windows OS at all. Should I check it or does someone already know the answer? :)
AFAIK, symlinks work more or less as you'd expect on Win 10/11.
Can you think of a quick & dirty way to test out these hypotheses and see if this is a viable path forward?
Of course I can, but can you please confirm that this request is still relevant?
I'll create a separate comment shortly to summarize the options on the table and address this there.
Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same! We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.
I suspect disk space isn't as big of a concern with modules, as those are mostly text (whereas providers are binaries in the tens of MBs). The time spent re-downloading (re-cloning) things is probably the bigger concern there.
Agree. What will be the decision?
- If we use a private registry for providers, should we also use a private registry for modules to reduce traffic
- Just do nothing
For now, let's gather all ideas, and then decide which ones to test out, and in which order. Reducing provider downloads is definitely a higher priority than the module stuff, so that should be the first thing to focus on.
I'm not sure how useful this is as far as addressing the issue from within terragrunt, but figured I'd share in the sense that there are approaches users could take to address the issues within their own pipelines/workflows... It is complicated though, and also maybe abuses some implementation details of terraform. Definitely welcome the conversation and would appreciate any features within terragrunt that address these issues more directly!
One thing I started doing is maintaining a single terraform config of all the providers and modules that are in use across the whole project. I call it the
vendor
config. Thisvendor
config provides no inputs to any module, and is used only for runningterraform init
. For example:terraform { required_version = "1.6.5" # terraform-version required_providers { aws = { source = "hashicorp/aws" version = "5.35.0" } } } module "foo" { source = "git::https://url/to/foo/module?ref=1.0.0" } module "bar" { source = "git::https://url/to/bar/module?ref=1.0.0" }
Then in the "real" terragrunt and terraform configs, the module
source
points to the relative path to the.terraform
directory that was initialized. E.g.source = "../../vendor/.terraform/modules/bar" source = "../../vendor/.terraform/modules/foo"
Before running any terragrunt commands, we run
terraform -chdir vendor init -backend=false -lock=false
to populate the provider and module cache. Combined withTF_PLUGIN_CACHE_DIR
, this setup ensures the providers and modules are only downloaded the one time over the network. We also use this setup to manage all module versions in a single place.(The provider versions in that
vendor
config also update the resulting.terraform.lock.hcl
, which we use across all stacks in the project. I think its a nice and clean way to manage the lock file, but I don't think that is quite as relevant to this particular issue.)
Thanks for sharing this approach! Definitely a cool idea.
As pointed out in subsequent comments, this approach doesn't quite seem to handle nested modules... And we have a lot of those. So it feels promising, but not quite complete.
So the idea with the lockers might work, since it first queries which versions exist in the registry, and then checks which exists in the cache. But there may be issues keeping the connection, Terraform processes must wait for the private registry, until a plugin is downloaded, so a timeout may occur, in the case when the user’s Internet speed is low and the plugin is large.
Did you actually test this out and see a timeout issue? Or are you just guessing that it might be an issue?
So far, only guesses.
Thanks for sharing this approach! Definitely a cool idea.
As pointed out in subsequent comments, this approach doesn't quite seem to handle nested modules... And we have a lot of those. So it feels promising, but not quite complete.
I could be wrong, but doesn't terraform init
ensure that all providers are downloaded for nested or nested nested modules? comment
OK, let me summarize the ideas on the table so far:
Reducing bandwidth and disk space usage with providers is the highest priority and should be the thing we focus on first.
As described here:
network_mirror
for downloading providers. There may be an issue here with timeouts related to step (4), so we'll have to test and see if this is workable.
Loosely based on @lorengordon's approach, as @levkohimins wrote up here:
terraform init
for this config.run-all
... .This is promising, but I've crossed it out because this approach doesn't handle nested modules. That is, the parsing in step (1) would only find the top-level modules, but after running init
on those, they may contain nested modules, which define further providers and other nested modules.
This is a slight tweak on idea 2:
required_providers
and module
blocks. required_providers
blocks into a single main.tf
.module
blocks into a single main.tf
, but (a) only copy the source
and version
parameters from within the body of a module
block, ignoring all other parameters so we don't have to deal with variables, resources, etc and (b) give each module
block a unique ID, to ensure they don't clash.terraform get
. This will just download all the modules including nested modules into the .terraform
folder.main.tf
to the underlying code in .terraform
, and then for each module in .terraform
, repeat the process recursively to find all nested modules. As you do this walk, parse out all required_providers
blocks, and copy it into a totally new main.tf.
init
on that totally new main.tf
.run-all
...This seemed like an approach that would allow us to fix the weaknesses of idea 2, but as I wrote it out, I realized this approach also has problems:
required_providers
. For providers in the TF registry, it's enough to include a provider
block or any resource or data source, and Terraform will automatically figure out which provider you want, and download it. So just extracting required_providers
and module
blocks is not enough!When I realized problem 1, I thought it might be solvable by pulling all resources, data sources, provider
blocks, etc into our mega main.tf
, but once I saw problem 2, I more or less gave up. This approach feels like a dead end. We'd be recreating so much TF logic, that we're almost certain to have weird bugs and difficulty maintaining this code.
Unless anyone has other ideas to consider, I recommend that we build the smallest prototype of idea 1 that we can. In fact, perhaps we should build a tiny web server that just hangs indefinitely (doesn't actually download providers or do any locking or anything else) solely to see if the timeout thing is going to be a real problem. If it is, we'll need new ideas. If not, we can proceed with having the prototype actually do some work.
source
URLsThe next priority is the source
URLs in TG, which are downloaded multiple times, and for which we do a full clone. We should only invest time in this after making improvements to problem 1 above.
I'd recommend:
~/.terragrunt/cache
. In this cache, we would only ever download a single unique source
URL just once. Then, when TG is running, what it puts into ~/.terragrunt-cache
is a bunch of symlinks pointing to ~/.terragrunt/cache
, plus any files it copies from the current working dir and any generated files. Generating a symlink for every file in the source
URL is probably a bit tedious, but having to download each repo only once saves time and bandwidth, and I'm guessing the symlinks will save some disk space over newly downloaded copies.module
downloadsThis is the next priority: Terraform re-downloading the same modules into .terraform
folders over and over again. We should only invest time in this after making improvements to problems 1 and 2 above.
I haven't heard any working ideas for how to improve on this yet, so please toss out ideas.
The final priority is explaining best practices for using TG in a place with ephemeral storage, such as K8S. We should only invest time in this after making improvements to problems 1, 2, and 3 above.
I think this is mainly documenting the need to use persistent disk stores.
doesn't
terraform init
ensure that all providers are downloaded for nested or nested nested modules?
Yes, it does. The "single config" option, using what I called the vendor
config, does retrieve all providers. And it will generate a lock file that contains all the provider constraints.
It also retrieves all modules, including nested ones. The problem with nested modules are those specifically with remote sources. If the source is local within the nested module, no problem, the local relative path is fine. But a remote source will re-download the remote module when init
is executed in the "real" config.
However, one thing that just occurred to me to address that, would be to pre-populate the .terraform/modules
directory of the "real" config, using the content previously retrieved using the vendor
config. Basically copy the modules from one directory to another, or maybe symlink if possible. Then terraform init
in the "real" config would see that all the remote modules are already present. Unfortunately, lining up the module label names and the directory names would take quite a bit more parsing....
Regarding pre-fetching provider binaries, what if we just have an opt-in configuration (like an environment variable named PRE_FETCH_BINARIES
) to support pre-fetching the provider binaries with a naive assumption that the providercache
will never change, then create an RFC requesting that OpenTofu move the providercache
package out of internal
?
If OpenTofu accepts the RFC, we can switch to an opt-out configuration, and rely on the public package to handle the logic for pre-fetching provider binaries in OpenTofu and use our naive custom logic for Terraform.
Would this handle your concerns regarding the potential changing logic in the internal
providercache
package, @levkohimins ?
@brikis98
Would it be a simpler first pass to only support concurrent pre-fetching with directories that contain a .terraform.lock.hcl
file? I've seen the file lock providers for nested modules in addition to the top most module. Those can be safely downloaded concurrently, as the provider
s in the .terraform.lock.hcl
file can be merged, then deduplicated to provide a list of plugins to pre-fetch. For timeouts that may occur due to bandwidth issues, we can make the max concurrent plugin downloads and download retries be configurable for low bandwidth environments.
For directories without .terraform.lock.hcl
files, they can be init
in series, which should be fast on repeat runs (and runs after the initial init
downloading the latest of a given provider plugin) if TF_PLUGIN_CACHE_DIR
is populated and the directory is persisted between runs.
Would this handle your concerns regarding the potential changing logic in the
internal
providercache
package, @levkohimins ?
@yhakbar, The concern is not only that the code we are interested in is located the internal/
directory, but that if the provider loading logic changes, we will have to rewrite the code in Terragrunt as well in any case. Although I think that such changes are unlikely to completely break the work of loading providers , since developers always try to maintain compatibility with older versions, but either way it puts some extra workload on us than just running the terraform init
command.
Idea 1: network mirror running on localhost
As described here:
- TG runs a server on localhost.
- TG configures that server as a
network_mirror
for downloading providers.- This server does in-memory locking to ensure there are no issues with downloading providers concurrently.
- TG also enables plugin caching. This ensures each plugin is only ever downloaded once.
There may be an issue here with timeouts related to step (4), so we'll have to test and see if this is workable.
@brikis98,
I found out that providers must be present in .terraform.lock.hcl
, otherwise Terraform re-downloads providers, even when they are already present in the cache. this means that one way or another, Terraform functionality must be partially implemented inside Terragrunt in order to generate this file, otherwise it simply will not work. What it looks like:
provider "registry.terraform.io/hashicorp/aws" {
version = "5.36.0"
constraints = "5.36.0"
hashes = [
"h1:54QgAU2vY65WZsiZ9FligQfIf7hQUvwse4ezMwVMwgg=",
"zh:0da8409db879b2c400a7d9ed1311ba6d9eb1374ea08779eaf0c5ad0af00ac558",
"zh:1b7521567e1602bfff029f88ccd2a182cdf97861c9671478660866472c3333fa",
"zh:1cab4e6f3a1d008d01df44a52132a90141389e77dbb4ec4f6ac1119333242ecf",
"zh:1df9f73595594ce8293fb21287bcacf5583ae82b9f3a8e5d704109b8cf691646",
"zh:2b5909268db44b6be95ff6f9dc80d5f87ca8f63ba530fe66723c5fdeb17695fc",
"zh:37dd731eeb0bc1b20e3ec3a0cb5eb7a730edab425058ff40f2243438acc82830",
"zh:3e94c76a2b607a1174d10f5712aed16cb32216ac1c91bd6f21749d61a14045ac",
"zh:40e6ba3184d2d3bf283a07feed8b79c1bbc537a91215cac7b3521b9ccb3e503e",
"zh:67e52353fea47eb97825f6eb6fddd1935e0ff3b53a8861d23a70c2babf83ae51",
"zh:6d2e2f390e0c7b2cd2344b1d5d6eec8a1c11cf35d19f1d6f341286f2449e9e10",
"zh:7005483c43926800fad5bb18e27be883dac4339edb83a8f18ccdc7edf86fafc2",
"zh:7073fa7ccaa9b07c2cf7b24550a90e11f4880afd5c53afd51278eff0154692a0",
"zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
"zh:a6d48620e526c766faec9aeb20c40a98c1810c69b6699168d725f721dfe44846",
"zh:e29b651b5f39324656f466cd24a54861795cc423a1b58372f4e1d2d2112d10a0",
]
}
About connection timeout concern, Terraform terminates connections if the registry does not respond after 10-15 seconds, and exits with the error Error: Failed to install provider
. So it turns out that the idea with locks is not feasible. Anyway, the idea with a private registry was not very promising, since issues could occur with firewalls, we would also have to check that the port on which the registry listens to connections is not busy, etc.
A workaround could be to (don't use private registry at all) run terraform init
non-parallel/sequentially for all terragrunt.hcl before the target command, which can then run in parallel, but we still have to generate .terraform.lock.hcl
files.
🤷♂️ Honestly, I don’t see any other way than to implement the functionality of fetching providers and modules by Terragrunt itself to ensure maximum performance and predictable behavior.
@brikis98 The module situation is not as dire as you might expect, though there are some gross caveats that should be payed attention to.
you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same! Terraform and OpenTofu are smart enough to only clone a module once during the same init step. The first time it encounters a source, it downloads it as expected. Every time after (during that same init command) it copies from the first directory.
I'd also highly recommend reading https://github.com/opentofu/opentofu/issues/1086 as it explains how cursed the handling of remote source in Terraform / OpenTofu is. The tl;dr is that modules are malleable after they are cloned and sometimes share the same directory, though it is impossible to know as a module author.
Hi, I definitely support this issue that happens when using TG at scale.
A workaround could be to (don't use private registry at all) run terraform init non-parallel/sequentially for all terragrunt.hcl before the main command, which can then run in parallel, but we still have to generate .terraform.lock.hcl files.
That's exactly what I tried but it significantly increases TG deployment time...
Idea 1: network mirror running on localhost
As described here:
- TG runs a server on localhost.
- TG configures that server as a
network_mirror
for downloading providers.- This server does in-memory locking to ensure there are no issues with downloading providers concurrently.
- TG also enables plugin caching. This ensures each plugin is only ever downloaded once.
There may be an issue here with timeouts related to step (4), so we'll have to test and see if this is workable.
@brikis98, I found out that providers must be present in
.terraform.lock.hcl
, otherwise Terraform re-downloads providers, even when they are already present in the cache. this means that one way or another, Terraform functionality must be partially implemented inside Terragrunt in order to generate this file, otherwise it simply will not work. What it looks like:provider "registry.terraform.io/hashicorp/aws" { version = "5.36.0" constraints = "5.36.0" hashes = [ "h1:54QgAU2vY65WZsiZ9FligQfIf7hQUvwse4ezMwVMwgg=", "zh:0da8409db879b2c400a7d9ed1311ba6d9eb1374ea08779eaf0c5ad0af00ac558", "zh:1b7521567e1602bfff029f88ccd2a182cdf97861c9671478660866472c3333fa", "zh:1cab4e6f3a1d008d01df44a52132a90141389e77dbb4ec4f6ac1119333242ecf", "zh:1df9f73595594ce8293fb21287bcacf5583ae82b9f3a8e5d704109b8cf691646", "zh:2b5909268db44b6be95ff6f9dc80d5f87ca8f63ba530fe66723c5fdeb17695fc", "zh:37dd731eeb0bc1b20e3ec3a0cb5eb7a730edab425058ff40f2243438acc82830", "zh:3e94c76a2b607a1174d10f5712aed16cb32216ac1c91bd6f21749d61a14045ac", "zh:40e6ba3184d2d3bf283a07feed8b79c1bbc537a91215cac7b3521b9ccb3e503e", "zh:67e52353fea47eb97825f6eb6fddd1935e0ff3b53a8861d23a70c2babf83ae51", "zh:6d2e2f390e0c7b2cd2344b1d5d6eec8a1c11cf35d19f1d6f341286f2449e9e10", "zh:7005483c43926800fad5bb18e27be883dac4339edb83a8f18ccdc7edf86fafc2", "zh:7073fa7ccaa9b07c2cf7b24550a90e11f4880afd5c53afd51278eff0154692a0", "zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425", "zh:a6d48620e526c766faec9aeb20c40a98c1810c69b6699168d725f721dfe44846", "zh:e29b651b5f39324656f466cd24a54861795cc423a1b58372f4e1d2d2112d10a0", ] }
About connection timeout concern, Terraform terminates connections if the registry does not respond after 10-15 seconds, and exits with the error
Error: Failed to install provider
. So it turns out that the idea with locks is not feasible. Anyway, the idea with a private registry was not very promising, since issues could occur with firewalls, we would also have to check that the port on which the registry listens to connections is not busy, etc.A workaround could be to (don't use private registry at all) run
terraform init
non-parallel/sequentially for all terragrunt.hcl before the target command, which can then run in parallel, but we still have to generate.terraform.lock.hcl
files.🤷♂️ Honestly, I don’t see any other way than to implement the functionality of fetching providers and modules by Terragrunt itself to ensure maximum performance and predictable behavior.
Thanks for looking into this.
Here's one more silly idea to try:
network_mirror
, as before.network_mirror
, we do not proxy the files, we just return a 4xx or 5xx. So Terraform will fail. That's OK, we can hide this failure message from the user.run-all
commands as necessary and everything should run from the cache.In short, we're letting Terraform figure out what providers it needs, and the network_mirror
is just there to get that information from Terraform. We can then use that to efficiently fetch all the providers we need, and then let Terraform run off the cache.
I'm skipping over a bunch of details, but at a high level, WDTY?
Thanks for looking into this.
Here's one more silly idea to try:
- TG runs a server on localhost and configures it as a
network_mirror
, as before.- As that mirror gets requests, it forwards them to the real underlying registry and proxies through that registry's response. So responses are just as fast as normal, so we don't hit timeout issues.
- When Terraform tries to download the actual providers from our localhost
network_mirror
, we do not proxy the files, we just return a 4xx or 5xx. So Terraform will fail. That's OK, we can hide this failure message from the user.- However, the localhost server has recorded all the providers that Terraform tried to download... So now we have the full list of all requested providers. We de-dupe the list, get the whole thing downloaded concurrently and added to the cache.
- Now we can run the
run-all
commands as necessary and everything should run from the cache.In short, we're letting Terraform figure out what providers it needs, and the
network_mirror
is just there to get that information from Terraform. We can then use that to efficiently fetch all the providers we need, and then let Terraform run off the cache.I'm skipping over a bunch of details, but at a high level, WDTY?
Interesting idea! But we still have to generate the .terraform.lock.hcl
file before running the terraform command, otherwise if our terragrunt.hcl
files have multiple identical providers, we will run into the following issues:
╷
│ Error: Failed to install provider from shared cache
│
│ Error while importing hashicorp/google v5.9.0 from the shared cache
│ directory: the provider cache at .terraform/providers has a copy of
│ registry.terraform.io/hashicorp/google 5.9.0 that doesn't match any of the
│ checksums recorded in the dependency lock file.
╵
This is why we must specify the --terragrunt-parallelism 1
flag when using the terraform cache, at least for now.
To generate this .terraform.lock.hcl
file, I think, we need to move some terraform logic, not sure how much, but I can figure out it, WDYT?
Thanks for looking into this. Here's one more silly idea to try:
- TG runs a server on localhost and configures it as a
network_mirror
, as before.- As that mirror gets requests, it forwards them to the real underlying registry and proxies through that registry's response. So responses are just as fast as normal, so we don't hit timeout issues.
- When Terraform tries to download the actual providers from our localhost
network_mirror
, we do not proxy the files, we just return a 4xx or 5xx. So Terraform will fail. That's OK, we can hide this failure message from the user.- However, the localhost server has recorded all the providers that Terraform tried to download... So now we have the full list of all requested providers. We de-dupe the list, get the whole thing downloaded concurrently and added to the cache.
- Now we can run the
run-all
commands as necessary and everything should run from the cache.In short, we're letting Terraform figure out what providers it needs, and the
network_mirror
is just there to get that information from Terraform. We can then use that to efficiently fetch all the providers we need, and then let Terraform run off the cache. I'm skipping over a bunch of details, but at a high level, WDTY?Interesting idea! But we still have to generate the
.terraform.lock.hcl
file before running the terraform command, otherwise if ourterragrunt.hcl
files have multiple identical providers, we will run into the following issues:
- It will not take into account already existing providers in the cache and will download them again and again.
- One instance overwrites an existing file in the cache, and other instances may already be using it, and then an error like this occurs:
╷ │ Error: Failed to install provider from shared cache │ │ Error while importing hashicorp/google v5.9.0 from the shared cache │ directory: the provider cache at .terraform/providers has a copy of │ registry.terraform.io/hashicorp/google 5.9.0 that doesn't match any of the │ checksums recorded in the dependency lock file. ╵
This is why we must specify the
--terragrunt-parallelism 1
flag when using the terraform cache, at least for now.To generate this
.terraform.lock.hcl
file, I think, we need to move some terraform logic, not sure how much, but I can figure out it, WDTY?
Let's assume for now that for any module without a lock file, we run init
sequentially to generate it.
If the network_mirror
approach works at all, then perhaps we can generate the lock file as part of that same process.
Resolved in v0.56.4 release. Make sure to read Provider Caching.
@levkohimins I think that only resolved the provider thing. There are many other tasks in this bug, so going to reopen.
Joining to the party because we are behind a solution for the explained problem.
Today I have tested latest Terragrunt (v0.57.2
) running in an Atlantis setup and I'm having some mixed results, which I think are due to the cache server being spin up for each thread and potentially causing a race condition.
Have you considered offering the cache server as a standalone service that I can spin up on instance boot and share among all processes?
Thank you for working on this!
@amontalban, That's true. Each Terragrunt instance runs its own cache server. We use file locking to prevent conflicts when multiple Terragrunt instances try to cache the same provider. What do you mean by
I'm having some mixed results
Thinking out loud, for the standalone server, we will need connections (like gRPC) between the Terragrunt instances and the Terragrunt Cache Server itself to receive notifications from the cache server when the cache is ready.
@brikis98, Interesting what you think about this.
What do you mean by
I'm having some mixed results
Hi @levkohimins!
Some of the plans work and some don't on the same Atlantis PR, and I think it is because all threads (We have parallel configuration in Atlantis running up to 10 at the same time) are trying to lock/download providers at the same time. For example a working one:
time=2024-04-18T22:43:57Z level=info msg=Terragrunt Cache server is listening on 127.0.0.1:36425
time=2024-04-18T22:43:57Z level=info msg=Start Terragrunt Cache server
time=2024-04-18T22:43:59Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/terraform-aws-modules/terraform-aws-iam.git?ref=v5.30.0 into /home/atlantis/.cache/terragrunt/modules/4eoLS_PnCDG--fz0b0bUcb6_sjY/Z_nexO2qqCg5RPmJa_gkAX4ynAY
time=2024-04-18T22:44:13Z level=info msg=Provider "registry.terraform.io/hashicorp/aws/5.45.0" is cached
Initializing the backend...
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing provider plugins...
- Finding hashicorp/aws versions matching ">= 5.43.0"...
A non working one:
time=2024-04-18T23:16:12Z level=info msg=Terragrunt Cache server is listening on 127.0.0.1:39541
time=2024-04-18T23:16:12Z level=info msg=Start Terragrunt Cache server
time=2024-04-18T23:16:16Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/terraform-aws-modules/terraform-aws-iam.git?ref=v5.37.1 into /home/atlantis/.cache/terragrunt/modules/CZJOkESJyj2a2n17Ph1lBtPy7p8/Z_nexO2qqCg5RPmJa_gkAX4ynAY prefix=[/home/atlantis/.atlantis/repos/ACME/terraform/838/provider_aws_dev__global_iam_roles_sre-role/provider/aws/security/_global/iam/policies/sre-assume-role]
time=2024-04-18T23:16:16Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/ACME/tf-aws-iam-saml-provider.git?ref=v1.0.1 into /home/atlantis/.cache/terragrunt/modules/1-d3ZTASqfksjn_orsKPDUGD7ks/lPrDQ0wT1dtsjgNOnzZoq0oCtyE prefix=[/home/atlantis/.atlantis/repos/ACME/terraform/838/provider_aws_dev__global_iam_roles_sre-role/provider/aws/security/_global/iam/identity_providers/X]
╷
│ Error: Failed to query available provider packages
│
│ Could not retrieve the list of available versions for provider
│ hashicorp/aws: host registry.terraform.io rejected the given authentication
│ credentials
Another error:
time=2024-04-18T22:43:57Z level=info msg=Terragrunt Cache server is listening on 127.0.0.1:38393
time=2024-04-18T22:43:57Z level=info msg=Start Terragrunt Cache server
time=2024-04-18T22:43:58Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/terraform-aws-modules/terraform-aws-iam.git?ref=v5.30.0 into /home/atlantis/.cache/terragrunt/modules/BBCsdcAtPBWyKxcKxrWGSmPnyMI/Z_nexO2qqCg5RPmJa_gkAX4ynAY
Error: Could not retrieve providers for locking
Terraform failed to fetch the requested providers for cache_provider in order
to calculate their checksums: some providers could not be installed:
- registry.terraform.io/hashicorp/aws: host registry.terraform.io rejected
the given authentication credentials.
And we have the following settings:
TERRAGRUNT_DOWNLOAD="$HOME/.cache/terragrunt/modules"
TERRAGRUNT_FETCH_DEPENDENCY_OUTPUT_FROM_STATE="true"
TERRAGRUNT_PROVIDER_CACHE=1
TERRAGRUNT_NON_INTERACTIVE="true"
TERRAGRUNT_INCLUDE_EXTERNAL_DEPENDENCIES="true"
Let me know if you want me to open an issue for this.
Thanks!
Hi @amontalban, thanks for the detailed explanation. Terragrunt Provider Cache is concurrency safe. Based on your log, I see an authentication issue.
rejected the given authentication credentials
Please create a new issue and indicate there the terraform version, your CLI Configuration, and also check if you are using any credentials. Thanks.
Hi @amontalban, thanks for the detailed explanation. Terragrunt Provider Cache is concurrency safe. Based on your log, I see an authentication issue.
rejected the given authentication credentials
Please create a new issue and indicate there the terraform version, your CLI Configuration, and also check if you are using any credentials. Thanks.
Thanks I will open an issue then.
Regarding Terragrunt Provider Cache being concurrency safe I understand it is if it used in a single terragrunt
process like a terragrunt run-all plan/apply
or terragrunt plan/apply
but what happens if I have multiple terragrunt
processes using the same directory at the same time (This is what Atlantis does in the background)?
Thanks!
Thanks I will open an issue then.
Thanks!
Regarding Terragrunt Provider Cache being concurrency safe I understand it is if it used in a single
terragrunt
process like aterragrunt run-all plan/apply
orterragrunt plan/apply
but what happens if I have multipleterragrunt
processes using the same directory at the same time (This is what Atlantis does in the background)?
By safe concurrency I meant multiple Terragrunt processes running at the same time.
@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?
@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?
You can specify the different cache directory --terragrunt-provider-cache-dir
@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?
You can specify the different cache directory
--terragrunt-provider-cache-dir
does it mean if i do that, i will have problem ? each job should have their own cache?
@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?
You can specify the different cache directory
--terragrunt-provider-cache-dir
does it mean if i do that, i will have problem ? each job should have their own cache?
The Terragrunt Provider Cache is concurrency safe, so you can run multiple Terragrunt processes with one shared cache directory. The only requirement is that the file system must support File locking.
if anyone like me looking to use this with aws EFS, it should work since EFS supports flock
hi @brikis98 @levkohimins From TG 0.55.20, 0.55.19, to the latest version we are having troubles in our terragrunt execution environment, while trying to download the terraform source URLs
while downloading the terraform source URLs, https:// is getting replaced by file:/// and the workflow is failing, failing to download the module zips.
It was working fine till TG 0.55.13.. Because of this issue, we are not able to use any of the recently delivered features.. Can you please look into this as a priority
Hey there! I have a question regarding how to handle multi platform with lock files in order to reduce disk & bandwidth usage? It seems to me that all the caching functionality only works for your own platform.
hi @brikis98 @levkohimins From TG 0.55.20, 0.55.19, to the latest version we are having troubles in our terragrunt execution environment, while trying to download the terraform source URLs
while downloading the terraform source URLs, https:// is getting replaced by file:/// and the workflow is failing, failing to download the module zips.
It was working fine till TG 0.55.13.. Because of this issue, we are not able to use any of the recently delivered features.. Can you please look into this as a priority
Hi @RaagithaGummadi, this issue is not related to this subject, if the issue still exists please let me know there #3141
Hey there! I have a question regarding how to handle multi platform with lock files in order to reduce disk & bandwidth usage? It seems to me that all the caching functionality only works for your own platform.
Hi @tomaaron, Could you please describe in detail how you create lock files for multiple platforms in your workflow when you do not use Terragrunt Provider Cache feature?
Hey there! I have a question regarding how to handle multi platform with lock files in order to reduce disk & bandwidth usage? It seems to me that all the caching functionality only works for your own platform.
Hi @tomaaron, Could you please describe in detail how you create lock files for multiple platforms in your workflow when you do not use Terragrunt Provider Cache feature?
That's actually what I'm trying to figure out. So far I have unsuccessfully tried the following:
terragrunt run-all providers lock -platform=linux_amd64 -platform=darwin_arm64 --terragrunt-provider-cache
But this seems to download the providers over and over again.
Hi @tomaaron, Could you please describe in detail how you create lock files for multiple platforms in your workflow when you do not use Terragrunt Provider Cache feature?
That's actually what I'm trying to figure out. So far I have unsuccessfully tried the following:
terragrunt run-all providers lock -platform=linux_amd64 -platform=darwin_arm64 --terragrunt-provider-cache
But this seems to download the providers over and over again.
Yeah it won't work. I'll look into what we can do to make this work through the Terragrunt Provider Cache.
The provider cache logic is working great for us in v0.59.3
.
I wanted to quickly touch on an observation regarding modules. I understand the complexity regarding modules sourced from inside the tf code, but what about modules sourced in the terragrunt.hcl
via terraform.source
?
Let's say you are deploying many modules from the same repository (i.e., a central module repository your organization users to manage all it's IaC). This is the current folder structure generated by terragrunt after a run-all
:
5.1M .terragrunt-cache/3rix7RH7iFg33ODrD1gLEVT22yo/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/3rix7RH7iFg33ODrD1gLEVT22yo
5.1M .terragrunt-cache/MmA0CPBHYpTye9HsWY4ZlZGLwFw/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/MmA0CPBHYpTye9HsWY4ZlZGLwFw
5.1M .terragrunt-cache/d81NMpsuvNM6qhA1rZ-uav4dFVk/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/d81NMpsuvNM6qhA1rZ-uav4dFVk
5.1M .terragrunt-cache/UgqUxk_-s8ObrE4DgusULaAIED0/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/UgqUxk_-s8ObrE4DgusULaAIED0
5.1M .terragrunt-cache/itgcNqUDI4z7w_EZhBbdGiEpuME/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/itgcNqUDI4z7w_EZhBbdGiEpuME
5.1M .terragrunt-cache/EUp73Y6F_pIUdcCjSz39JqPTf8Y/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/EUp73Y6F_pIUdcCjSz39JqPTf8Y
5.1M .terragrunt-cache/Nn6GCqZomEqW-I0kI-EhLgkuomY/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/Nn6GCqZomEqW-I0kI-EhLgkuomY
5.1M .terragrunt-cache/HZnOUpx70Z2weiSaofmUdPh0__s/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/HZnOUpx70Z2weiSaofmUdPh0__s
5.1M .terragrunt-cache/gzT7dpbdCN4CsF2-edwgdoqACuA/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/gzT7dpbdCN4CsF2-edwgdoqACuA
5.1M .terragrunt-cache/YRE-N4jGrdpJ0FtLnuUyVSo7Wzc/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/YRE-N4jGrdpJ0FtLnuUyVSo7Wzc
5.2M .terragrunt-cache/MWrb3OuAKhpDTESIuBhmlq3NSQA/rMGBvHP1LfFG1CysJakDOKD_N7E
5.2M .terragrunt-cache/MWrb3OuAKhpDTESIuBhmlq3NSQA
5.1M .terragrunt-cache/Qy1CR7BsXyAWUjoosWLEASaTxhs/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/Qy1CR7BsXyAWUjoosWLEASaTxhs
5.1M .terragrunt-cache/HDzArxE1BZrqfWzwUhAMR7oEgn0/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/HDzArxE1BZrqfWzwUhAMR7oEgn0
5.1M .terragrunt-cache/WC0gQMfp_QDsGXP4_f2JtXSkgzc/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/WC0gQMfp_QDsGXP4_f2JtXSkgzc
5.1M .terragrunt-cache/Yy_-DZCGX_UOfD-pbPG761V125w/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/Yy_-DZCGX_UOfD-pbPG761V125w
5.1M .terragrunt-cache/YCdKLNQqedXm9ub6jegJIFld5nk/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/YCdKLNQqedXm9ub6jegJIFld5nk
5.2M .terragrunt-cache/YtgIqn3gMfAYH6cYUcQxDJX0zDM/rMGBvHP1LfFG1CysJakDOKD_N7E
5.2M .terragrunt-cache/YtgIqn3gMfAYH6cYUcQxDJX0zDM
5.1M .terragrunt-cache/Mq8YwWL9cfsuq7Il2s1tx61d1uE/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/Mq8YwWL9cfsuq7Il2s1tx61d1uE
5.1M .terragrunt-cache/_IZun15G6UQ7ISTOOoIh03ssz94/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/_IZun15G6UQ7ISTOOoIh03ssz94
5.2M .terragrunt-cache/7sIBtlwoqWetYkSfAAz4WVri_Ww/rMGBvHP1LfFG1CysJakDOKD_N7E
5.2M .terragrunt-cache/7sIBtlwoqWetYkSfAAz4WVri_Ww
5.1M .terragrunt-cache/KjklxglTSshA5EobgBlPhS6nxqM/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/KjklxglTSshA5EobgBlPhS6nxqM
5.1M .terragrunt-cache/mkPxI_UU-AY8IzaeHCyy8wr7lUo/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/mkPxI_UU-AY8IzaeHCyy8wr7lUo
5.1M .terragrunt-cache/xq9QbIDfQA5Um36Gg64Xe65B1eE/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/xq9QbIDfQA5Um36Gg64Xe65B1eE
509K .terragrunt-cache/Oe4bddeyp6czI6a57wplAmL2UmE/VxjwqM7fVch8LuMtArj_n9dvQ4s
518K .terragrunt-cache/Oe4bddeyp6czI6a57wplAmL2UmE
5.2M .terragrunt-cache/HuEWN25ezv3gq4O4AnW32Q_cRrk/rMGBvHP1LfFG1CysJakDOKD_N7E
5.2M .terragrunt-cache/HuEWN25ezv3gq4O4AnW32Q_cRrk
5.1M .terragrunt-cache/FJ9XgfAqZ4CVhPoDACdfnneh5MQ/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/FJ9XgfAqZ4CVhPoDACdfnneh5MQ
5.1M .terragrunt-cache/MdEw_hom75zrGwZ6KF3fd6Nd2f8/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/MdEw_hom75zrGwZ6KF3fd6Nd2f8
5.1M .terragrunt-cache/I8u4zu5O3M7zvvJLapgf4H6oOlM/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/I8u4zu5O3M7zvvJLapgf4H6oOlM
5.1M .terragrunt-cache/bRqItD_kx05-uvA-wgJAZRxm-Ek/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/bRqItD_kx05-uvA-wgJAZRxm-Ek
5.2M .terragrunt-cache/mM9_RP1KrV_d332pAYmsbKJDGEc/rMGBvHP1LfFG1CysJakDOKD_N7E
5.2M .terragrunt-cache/mM9_RP1KrV_d332pAYmsbKJDGEc
5.2M .terragrunt-cache/kbVhkVgP5lz_GmIuhf6-E823wH4/rMGBvHP1LfFG1CysJakDOKD_N7E
5.2M .terragrunt-cache/kbVhkVgP5lz_GmIuhf6-E823wH4
5.1M .terragrunt-cache/Geu9JmX6UX-s3V5IVQKdepcm3ko/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/Geu9JmX6UX-s3V5IVQKdepcm3ko
5.1M .terragrunt-cache/9jvl4n7eKkxK-hfwAH6HTQ1Vf7s/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/9jvl4n7eKkxK-hfwAH6HTQ1Vf7s
5.1M .terragrunt-cache/GojBr0C5SSfzNk32R1JBn6oNA7o/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/GojBr0C5SSfzNk32R1JBn6oNA7o
5.1M .terragrunt-cache/Ikah_6N-NG1g9Iui7xvwBYOqMgw/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/Ikah_6N-NG1g9Iui7xvwBYOqMgw
5.1M .terragrunt-cache/vwB2UhcBaj8e7JyIf2fEoWV6A04/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/vwB2UhcBaj8e7JyIf2fEoWV6A04
5.1M .terragrunt-cache/yuR2rmGeXT7YhaYfT7K5jMENEpQ/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/yuR2rmGeXT7YhaYfT7K5jMENEpQ
5.1M .terragrunt-cache/Olx1u9kIH91FjbGeJAhj3O37jGo/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/Olx1u9kIH91FjbGeJAhj3O37jGo
5.2M .terragrunt-cache/ZRJLv7QHDp18T6iNvWkvyXf1DeM/rMGBvHP1LfFG1CysJakDOKD_N7E
5.2M .terragrunt-cache/ZRJLv7QHDp18T6iNvWkvyXf1DeM
5.1M .terragrunt-cache/qc8xOSXjw5krh-z65uktrADm26w/rMGBvHP1LfFG1CysJakDOKD_N7E
5.1M .terragrunt-cache/qc8xOSXjw5krh-z65uktrADm26w
199M .terragrunt-cache/
Notice the rMGBvHP1LfFG1CysJakDOKD_N7E
and its contents are duplicated many times (this represents the code from the remote repo). It seems like some caching logic of the terraform.source
string could be implemented fairly easily to prevent this duplication. Perhaps as an opt-in flag.
Since sourcing many modules from the same repo seems like the standard way to use terragrunt, this would be a big disk and network savings for anyone with more than a few modules.
Note that we tried to do this on our own by having terraform.source
set to a run_cmd
block that implements the module download and cloning in a cached manner by exchanging remote git urls with file system paths. However, terragrunt still copies all the files from the local path for every module instead of linking them.
Hi @fullykubed, thanks for the feedback! Your observations are quite convincing and perhaps we could also optimize the modules. I will think it over.
The problem
Many of our customers struggle with TG eating up a ton of disk space and bandwidth: hundreds of gigabytes in some cases! I think this comes from a few sources (note: https://github.com/gruntwork-io/terragrunt/issues/2919 may help provide the data we need understand this better):
init
. This isn't a problem for a single module, but if you dorun-all
in a repo that has, say, 50terragrunt.hcl
files, each one runsinit
on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!source
URL in TG, it downloads the whole repo into a.terragrunt-cache
folder. If you have 50terragrunt.hcl
files withsource
URLs, and dorun-all
, it will download the repos 50 times—even if all 50 repos are the same!source
URL. We should consider doing a shallow clone, as that would be much faster/smaller.init
, Terraform downloads repos to a.modules
folder. If you have 50terragrunt.hcl
files, each of which has asource
URL pointing to TF code that contains, say, 10 modules, then when you dorun-all
, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!Goals
We should have the following goals:
git clone
should be as efficient as possible: e.g., use shallow clones.The solution
The solution will need to be some mix of:
provider_installation
block to achieve the goals in the previous section.depth=1
param to tell go-getter to do a shallow clone.Notes