Don't make redundant copies of the same module package

hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

https://www.terraform.io

Other

42.76k stars 9.56k forks source link

Don't make redundant copies of the same module package #29503

Open YuriGal opened 3 years ago

YuriGal commented 3 years ago

Current Terraform Version

1.0.5

Use-cases

We have a Github repository of modules with each module in its own subfolder. Module usage in a project looks something like this

module "alb" {
  source       = "git@github.com:org/repo.git//modules/alb_module?ref=1.52.0"
  name         = var.name
  subdomain    = local.subdomain
  idle_timeout = local.timeout
  environment  = module.environment
}

A project can reference quite a few of such modules from that repo. And for every referenced module a full package with full repo content is downloaded. This results in multiple identical copies of the repository being downloaded taking extra space and possibly causing other issues.

Proposal

I understand that this is a behavior by design, but it would be great if terraform could optimize it. Basically detect that multiple module references are pointing to the same source repository and same release version and download the package into one common location. Kinda like Yarn does with workspaces and common node_modules.

6017

-->

apparentlymart commented 3 years ago

Hi @YuriGal! Thanks for sharing this use-case.

The important design consideration here is that existing modules unfortunately expect to be able to create new files inside path.module, and thus each module must have its own separate directory so that multiple modules with the same source address won't conflict with one another.

We don't yet have a design that can address that need in a backward-compatible way, and thus we have so far retained the original design of having each module block generate its own local source directory. Any design to solve this will need to fit within the Terraform 1.0 Compatibility Promises.

An example of such a design might be for Terraform to analyze a module to determine whether it uses path.module and behave differently if so. However, that means that modules would be treated quite differently due to a relatively small detail about their implementation, and thus this could be potentially problematic for someone maintaining a module over time who inadvertently switches a module into or out of this special behavior.

This is also related to the existing issue #21308, which discusses a possible alternative to path.module where a module could write outside of its source directory, and thus avoid the problem I've described above. Had we been able to complete design and implementation of that prior to Terraform 1.0 then we could potentially have used it as a justification to deprecate and eventually break the pattern of writing into path.module, but that option is no longer available to us and thus even if we do implement something like the path.temp proposal there we would still be committed to preserve compatibility with modules that write files into path.module.

As we currently stand, Terraform's module installer does recognize situations where two modules have a trivially equal source address (where the syntax of the address is the same, as opposed to where the addresses could both refer to the same location with different syntax) and will copy the existing local directory rather than repeatedly downloading in that case.

That mechanism is intended to address the performance impact of repeatedly downloading the same package, but it doesn't address the cost of storing the module multiple times on local disk.

Terraform has historically assumed that local disk storage is cheap and thus there are various behaviors that spend more disk space than they technically need to in exchange for other benefits, such as a lightweight measure of isolation between configurations/modules.

We do recognize that there are situations where heavy use of local storage is problematic, such as in transient "scale to nothing" execution environments where the local filesystem is often transient and limited in size, but I do want to be clear that if you wish to use Terraform today you should design your use of it around the assumption that it will create a potentially large footprint in whatever filesystem contains your working directory, because these behaviors run deep and are unlikely to change in the near future.

fblgit commented 3 years ago

@apparentlymart What about creating the folder and instead of copying the files, on those OS+FS that allow it we just produce symlinks? Would that work ??

TBH, I'm not sure what kind of size we are talking about .. but AFAIK, if u follow best practices for both Terraform Modules & Git repositories..u shouldn't be on scenarios where the storage can become a problem?

joe-a-t commented 2 years ago

@apparentlymart regarding what I put in https://github.com/hashicorp/terraform/issues/31422, can you please elaborate on the concerns about compatibility? I'm unclear on how having a .terraform/sources directory that the .terraform/modules/modules.json points to would run into compatibility concerns when it seems to be basically the same runtime behavior as how Terraform handles local sources where you don't have a separate copy of the module source in the .terraform/modules/<name> directories.

apparentlymart commented 2 years ago

To be concrete about it, it is currently possible to write something like the following in a Terraform module and have it work, and so we are bound to keep this working even though I would assert that it's not a good idea to use the local filesystem for transient storage during a Terraform operation:

variable "content" {
  type = string
}

resource "local_file" "example" {
  content  = var.content
  filename = "${path.module}/temporary.txt"
}

data "local_file" "example" {
  filename = local_file.example.filename
}

output "content" {
  value = data.local_file.example.content
}

If the above were published as a shared module called from multiple separate module blocks in the same configuration today then each would get its own separate path.module and thus its own separate temporary.txt. If Terraform instead tried to make them all share the same directory, they would all race to write to and read from the same temporary.txt, causing unpredictable results.

The above example is of course contrived just to keep it simple but this applies to any situation where a module writes to a file inside its path.module and then either reads it itself or returns the path to a caller to be read. A more common real-world situation is the regrettably-designed archive_file data source, which creates files on disk during the planning phase which modules then pass to other resources, such as aws_lambda_function resources. I focused on local files above only because it's easier to play with the above example without needing cloud credentials to test with.

If we can find a design compromise that remains compatible with both the above and the more-intended use of path.module to read static files distributed as part of the module then we could potentially use it to meet this request, as long as it doesn't cause too much additional complexity or weird special cases into Terraform's design. I don't currently have such a design to propose, though.

joe-a-t commented 2 years ago

Ok, so sticking with the backwards compatibility on that, could we either:

Leave the .terraform/modules/... as a place where any files that Terraform writes can get written but the source is not download/copied into. Instead Terraform looks to see if a .terraform/modules/... exists for that module, and if so, read in both the contents of that .terraform/modules/... directory and the contents from the relevant .terraform/sources/... directory (as specified in a new part of the manifest.json) as if they all existed in the same directory. In this case, if someone does not write anything using path.module, the .terraform/modules would either not exist or be empty.
Create the existing .terraform/modules/... directory structure, but not copy over the source files and instead create a new file in the existing .terraform/modules/... directory that tells Terraform to also read in the contents of .terraform/sources/... when evaluating the directory.

Would either of those be an acceptable compromise and within the capabilities of Terraform without going too crazy?

apparentlymart commented 2 years ago

This (more reasonable, more common) pattern must also keep working:

resource "aws_s3_bucket_object" "example" {
  bucket = "example"
  key    = "anything"
  source = "${path.module}/bucket-content/anything"
}

In this case, bucket-content/anything is a file distributed as part of the module package (e.g. it's included in the git repository along with this .tf file).

The key design challenge here is that path.module today represents both "the directory where my source files are" and "a directory where I can write temporary files during my work". The second use-case was never an intended use of path.module, but one that existing modules have come to rely on and is therefore unfortunately de-facto covered by compatibility promises.

I'm not currently seeing any design that can allow path.module to refer to a shared directory for reads, but to separate directories for writes. I don't think your proposals here meet that requirement, although I might be misunderstanding; if so, it'd help if you can describe in more detail what exactly Terraform would be doing in each of these examples.

(There are some possible designs relying on OS-level features such as overlay filesystems, but Terraform does not typically run with suitable permissions to make use of those and they are not available in a standard way across all of the platforms we target.)

joe-a-t commented 2 years ago

So what I was thinking of for reads is that when Terraform sees path.module, it looks in .terraform/modules/.... If it does not find the desired directory/file, it then goes to the .terraform/sources/... directory and tries looking for it from there. So basically changing the current read from a "Try A, if not, then fail" to a "Try A, if not, Try B, if not, then fail".

apparentlymart commented 2 years ago

path.module is just a static string that gets interpolated into the path like any other, so Terraform must decide what path.module evaluates to once for the entire module, not separately for each use.

For example, consider that in my second case which contains an argument source = "${path.module}/bucket-content/anything" it is the AWS provider rather than Terraform Core that is actually reading the file, and indeed only the AWS provider knows that it's treating that string as a filename. Terraform Core's only contribution to this process is to replace ${path.module} with a string like .terraform/modules/foo, so that the AWS provider will see .terraform/modules/foo/bucket-content/anything and decide what to do with that string.

We know from the AWS provider documentation that the provider is going to treat that path as a filename to read from, but Terraform Core cannot know that: the provider could just send that path to a remote system and not treat it as a filename at all, or it might write something to that path.

joe-a-t commented 2 years ago

Ahhhh, I did not realize exactly where the boundary was there between Terraform Core and the providers. I did not realize that providers directly interact with the file system without going through Terraform Core. I thought maybe Terraform Core could handle the retry logic and implement it once for all providers, but it sounds like you're saying that all providers would need to implement the retry logic themselves (unless if maybe Terraform Core was intercepting those filesystem calls and able to handle the retry logic there? But that sounds like it would be introducing a ton of complexity that I completely understand wanting to avoid).

apparentlymart commented 2 years ago

Indeed... the only situation where Terraform Core is directly interacting with the filesystem is in function calls like file("${path.module}/foo.txt") and even in that case the interpolation of path.module as a fixed string happens before calling the function and so the function only sees the resulting path string, can't cannot tell that it was constructed from path.module.

joe-a-t commented 2 years ago

Just to clarify a bit further, the racing herd problem when writing files to path.module is an existing issue when/if people are using local paths for module sources. So anyone that is using path.module as part of write operations will likely see different behavior if they are doing so with local or remote source and using the module multiple times. I also want to note that even if people only have a single module block, but they have a for_each inside that block, they will also have this existing issue regardless of if they are using local or remote module source.

I actually don't see anything on https://www.terraform.io/language/expressions/references#filesystem-and-workspace-info warning about this behavior or advising against using path.module for write operations in a child module. I agree that any behavior change around this would break some modules that people have, but could also make an argument that they are relying on a bug to do something that Terraform has not promised to support since Terraform has not made any published promises about remote module sources intentionally behaving differently at runtime than local module source (AFAIK). Is that all an accurate and fair depiction of the current state?

joe-a-t commented 2 years ago

Also I opened https://github.com/hashicorp/terraform/issues/31441 as a documentation issue to add the risks with path.module that you've called out in this conversation to the documentation for path.module.

apparentlymart commented 2 years ago

Unfortunately we know that the existing uses are prevalent enough that we cannot use the lack of documentation as justification for the breaking change. The following language in our compatibility promises summarizes this compromise:

If the actual behavior of a feature differs from what we explicitly documented as the feature's behavior, we will usually treat that as a bug and change the feature to match the documentation, although we will avoid making such changes if they seem likely to cause broad compatibility problems. We cannot promise to always remain "bug-compatible" with previous releases, but we will consider such fixes carefully to minimize their impact.

This situation fails the above test in two ways:

We didn't explicitly document how Terraform would treat this situation, and so authors were left to make their own conclusions as to what the intended behavior was. Therefore in this case the implementation is what we are constrained by, rather than the documentation.
We know from prior research that there are many existing modules depending on this behavior, and so changing this would cause broad compatibility problems. (This goal of not copying a module for each call has been on the table since before Terraform v0.12, when we first tried to make the change requested here. The problem was so prevalent at that time that we were blocked from changing it even though the compatibility promises didn't apply yet, because of general pressure to reduce the amount of hurdles to upgrading from Terraform v0.11.)

You are right that there are already situations where using path.module in that way would not work. We are not compelled to make those work in future because they never worked in the past, but we are constrained by our compatibility promises to make the situations where it was working remain working, or at least to make the new behavior an opt-in so that existing modules can keep working the way they always did.

joe-a-t commented 2 years ago

Understood. Hopefully this refactor can make the list for v2.0.0 and I'm not sure what the timeline is for that, but hopefully it can come relatively soon (with relatively meaning before 2024). In the meantime we will probably have to do our own workarounds like I described as possibilities in https://github.com/hashicorp/terraform/issues/31422

gtmtech commented 2 years ago

@apparentlymart Is it possible to allow the user calling the module to specify the behaviour (symlink or copy) in the module block.

e.g.

module "foo" {
    source = "bar"
    download_method = "copy"   # or linked
}

This way its in the users hands whether to incur a copy at extra disk space, or linked with the disadvantages with files above.

In my case I am invoking the same module for every AWS account for every AWS region - thats about 800 invocations. It is adding up to quite a lot of disk space and quite slow to cleanup afterwards too, I'd love to be able to choose the behaviour

joe-a-t commented 2 years ago

If ^ was an option, would strongly prefer a global setting instead of needing to specify per module invocation.

apparentlymart commented 2 years ago

It seems to me that whether it's safe/correct to reuse the same directory for multiple instances of the same module is a property of the module itself rather than a property of the call to that module, and so if we were going to make this an opt-in sort of thing I expect it would be better for that opt-in to be inside the module being called rather than in the module block that calls the module.

It would essentially be a promise from the module developer that planning and applying the module will not cause any modifications to the module's source directory. We might consider helping to reinforce that by having Terraform mark the directory and the files inside as read-only.

However, I will say that one challenge with this approach which occurs to me from some initial thought experimenting is that there's a slight mismatch of concept here: the artifact Terraform is placing in the filesystem is the entire module package which contains the module being requested, and so it seems like a statement about this being read-only ought to be something that is scoped to the module package as a whole rather than to an individual module inside it. We don't have any precedent for metadata on module packages, so we'd need to figure out what is the best way to represent that.

sato-s commented 3 weeks ago

I'm using my own modules and I know what's inside.
I'm pretty sure there's no local file write and it's safe to re-use same module directory. But, Terraform makes 100 copies of same module.

If this happens, terraform plan takes intolerable amount of time in case of terraform cloud backend.

I really hope there's an option like https://github.com/hashicorp/terraform/issues/29503#issuecomment-1306062900

hashicorp / terraform

Don't make redundant copies of the same module package #29503

Current Terraform Version

Use-cases

Proposal

6017