hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.6k stars 9.54k forks source link

1.4.0+ breaks shared provider cache #32901

Open dbadrak opened 1 year ago

dbadrak commented 1 year ago

Terraform Version

% TF_CLI_CONFIG_FILE=../.tf-control.tfrc terraform_1.4.2 version
Terraform v1.4.2
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v4.48.0
+ provider registry.terraform.io/hashicorp/external v2.2.3
+ provider registry.terraform.io/hashicorp/local v2.2.3
+ provider registry.terraform.io/hashicorp/null v3.2.1
+ provider registry.terraform.io/hashicorp/random v3.4.3
+ provider registry.terraform.io/hashicorp/template v2.2.0
+ provider registry.terraform.io/hashicorp/time v0.9.1
+ provider registry.terraform.io/trevex/ldap v0.5.4

Terraform Configuration Files

# .tf-control.tfrc
plugin_cache_dir   = "/data/terraform/terraform.d/plugin-cache"
provider_installation {
  filesystem_mirror {
    path    = "/data/terraform/terraform.d/providers"
    include = [ "*/*/*" ]
  }
  direct {
    include = [ "*/*/*" ]
  }
}

Debug Output

Successful init upgrade (1.3.9)

% TF_CLI_CONFIG_FILE=../.tf-control.tfrc terraform_1.3.9 init -upgrade
Initializing the backend...

Initializing provider plugins...
- terraform.io/builtin/terraform is built in to Terraform
- Finding hashicorp/template versions matching ">= 1.0.0, >= 2.0.0"...
- Finding hashicorp/aws versions matching ">= 3.0.0, >= 3.66.0"...
- Finding hashicorp/external versions matching ">= 1.0.0, >= 1.1.0, >= 2.2.0"...
- Finding hashicorp/null versions matching ">= 1.0.0, >= 3.0.0"...
- Finding trevex/ldap versions matching ">= 0.5.4"...
- Finding latest version of hashicorp/time...
- Finding hashicorp/random versions matching ">= 1.0.0, >= 3.0.0"...
- Finding hashicorp/local versions matching ">= 1.0.0"...
- Using previously-installed hashicorp/external v2.3.1
- Using previously-installed hashicorp/null v3.2.1
- Using previously-installed trevex/ldap v0.5.4
- Using previously-installed hashicorp/time v0.9.1
- Using previously-installed hashicorp/random v3.4.3
- Using previously-installed hashicorp/local v2.4.0
- Using previously-installed hashicorp/template v2.2.0
- Using previously-installed hashicorp/aws v4.59.0

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

Failed init upgrade (1.4.0+)

% TF_CLI_CONFIG_FILE=../.tf-control.tfrc terraform_1.4.2 init -upgrade
Initializing the backend...
Upgrading modules...
Downloading ... # removed

Initializing provider plugins...
- terraform.io/builtin/terraform is built in to Terraform
- Finding hashicorp/external versions matching ">= 1.0.0, >= 1.1.0, >= 2.2.0"...
- Finding trevex/ldap versions matching ">= 0.5.4"...
- Finding hashicorp/null versions matching ">= 1.0.0, >= 3.0.0"...
- Finding hashicorp/local versions matching ">= 1.0.0"...
- Finding hashicorp/random versions matching ">= 1.0.0, >= 3.0.0"...
- Finding latest version of hashicorp/time...
- Finding hashicorp/template versions matching ">= 1.0.0, >= 2.0.0"...
- Finding hashicorp/aws versions matching ">= 3.0.0, >= 3.66.0"...
- Using previously-installed trevex/ldap v0.5.4
- Using previously-installed hashicorp/null v3.2.1
- Installing hashicorp/local v2.4.0...
- Using previously-installed hashicorp/random v3.4.3
- Using previously-installed hashicorp/time v0.9.1
- Using previously-installed hashicorp/template v2.2.0
- Installing hashicorp/aws v4.59.0...
- Installing hashicorp/external v2.3.1...
╷
│ Error: Failed to install provider
│
│ Error while installing hashicorp/local v2.4.0: chmod /data/terraform/terraform.d/plugin-cache/registry.terraform.io/hashicorp/local/2.4.0/linux_amd64/terraform-provider-local_v2.4.0_x5: operation not permitted
╵
╷
│ Error: Failed to install provider
│
│ Error while installing hashicorp/aws v4.59.0: open /data/terraform/terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.59.0/linux_amd64/terraform-provider-aws_v4.59.0_x5: permission denied

The first file has a different owner than the user running this script. The second file, the user running the script does not have write access to the file. However, the umask is 002 and write access should have been set so the ACL can offer the write capability.

Expected Behavior

Provider should install and continue.

Actual Behavior

Provider install fails, and we cannot continue.

Steps to Reproduce

  1. terraform init -upgrade

Additional Context

We used a shared plugin cache and provider tree. We have permissions set and ACLS on the files and directories to allow all our users (who belong to a specific group) to write into this directory. Beginning with 1.4.0 (still present it 1.4.2), we get a failure on chmod() of the file, because the user writing the file is not the owner.

It appears the umask is not being honored on create. An strace shows it setting the mode to 0755 (vs 775, which is what I would expect for an executable & the umask). This will fail if the owner of the file is not the user running it. If the owner IS the user, I would expect a 775 permission vs 755.

275964 12:13:48 fchmodat(AT_FDCWD, "/data/terraform/terraform.d/plugin-cache/registry.terraform.io/hashicorp/local/2.4.0/linux_amd64/terraform-provider-local_v2.4.0_x5", 0755 <unfinished ...>
275961 12:13:48 <... nanosleep resumed>NULL) = 0
275961 12:13:48 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
275964 12:13:48 <... fchmodat resumed>) = -1 EPERM (Operation not permitted)

So, two issues, it seems:

  1. chmod() on installing provider files not using umask
  2. chmod on files not owned

References

No response

apparentlymart commented 1 year ago

Hi @dbadrak! Thanks for reporting this.

I don't think Terraform v1.4 should have changed the details of what file modes Terraform uses when extracting the plugin packages, but Terraform v1.4 does change some of the implementation details about when terraform init -upgrade would try to write new content into the cache directory.

It seems then that, as you noted, the root problem here is not really with the plugin cache directory behavior but rather that Terraform treats all of these directories as if they belong to and are used by only a single user, but in your case you are expecting to share the plugin cache directory between multiple users who should all be able to write into it and execute from it.

I suspect (but haven't yet verified) that the file mode here is being set by the upstream library Terraform uses for extracting the provider package zip files, which does some custom work related to umask: https://github.com/hashicorp/go-getter/blob/91e93376c7e9720dc7fd5031878c6ff0430b0631/get_file_copy.go#L50-L53

The caller to this in Terraform seems to be just setting that umask argument to all zeroes:

https://github.com/hashicorp/terraform/blob/bd75dade9c6e2e7ddfd943977e4875bb45091315/internal/providercache/package_install.go#L136-L139

...and so therefore the mode created on disk is presumably exactly the mode recorded in the extended attributes in the zip file, which is under the control of the provider developer who prepared the zip file.

I don't think this fully explains the problem because Terraform itself is also trying to refresh an existing cache to make it match the archive and so is effectively trying to chmod a file that was already present and in your case owned by a different user.

As currently designed the provider cache is not designed to be shared between different users on the same system, which extends from the idea that CLI configuration is also typically set on a per-user basis, but I agree it would be a good improvement to officially support this configuration. In the meantime though to get a working configuration you will need to use a separate cache directory per user who runs Terraform; that this was working on older versions was only by chance due to relying on some implementation details that have changed in v1.4 to ensure that Terraform is always able to update the dependency lock file correctly. Since the old implementation details were apparently working for your situation by coincidence, you may be able to work around this by using the temporary option to restore the previous implementation details, even though your reason for enabling it would not be related to the dependency lock file as that documentation assumes.

This also seems related to #31964, which is about making it safe to concurrently run multiple Terraform processes that might interact with the same cache directory. Given that the two enhancements would be made in essentially the same part of Terraform, it would probably make sense to work on them together as a single PR. Having multiple users sharing the same cache directory also seems like it would increase the likelihood of two Terraform processes trying to update the directory concurrently.

dbadrak commented 1 year ago

Hi @dbadrak! Thanks for reporting this.

I don't think Terraform v1.4 should have changed the details of what file modes Terraform uses when extracting the plugin packages, but Terraform v1.4 does change some of the implementation details about when terraform init -upgrade would try to write new content into the cache directory.

It seems then that, as you noted, the root problem here is not really with the plugin cache directory behavior but rather that Terraform treats all of these directories as if they belong to and are used by only a single user, but in your case you are expecting to share the plugin cache directory between multiple users who should all be able to write into it and execute from it.

I suspect (but haven't yet verified) that the file mode here is being set by the upstream library Terraform uses for extracting the provider package zip files, which does some custom work related to umask: https://github.com/hashicorp/go-getter/blob/91e93376c7e9720dc7fd5031878c6ff0430b0631/get_file_copy.go#L50-L53

The caller to this in Terraform seems to be just setting that umask argument to all zeroes:

https://github.com/hashicorp/terraform/blob/bd75dade9c6e2e7ddfd943977e4875bb45091315/internal/providercache/package_install.go#L136-L139

...and so therefore the mode created on disk is presumably exactly the mode recorded in the extended attributes in the zip file, which is under the control of the provider developer who prepared the zip file.

I don't think this fully explains the problem because Terraform itself is also trying to refresh an existing cache to make it match the archive and so is effectively trying to chmod a file that was already present and in your case owned by a different user.

As currently designed the provider cache is not designed to be shared between different users on the same system, which extends from the idea that CLI configuration is also typically set on a per-user basis, but I agree it would be a good improvement to officially support this configuration. In the meantime though to get a working configuration you will need to use a separate cache directory per user who runs Terraform; that this was working on older versions was only by chance due to relying on some implementation details that have changed in v1.4 to ensure that Terraform is always able to update the dependency lock file correctly. Since the old implementation details were apparently working for your situation by coincidence, you may be able to work around this by using the temporary option to restore the previous implementation details, even though your reason for enabling it would not be related to the dependency lock file as that documentation assumes.

This also seems related to #31964, which is about making it safe to concurrently run multiple Terraform processes that might interact with the same cache directory. Given that the two enhancements would be made in essentially the same part of Terraform, it would probably make sense to work on them together as a single PR. Having multiple users sharing the same cache directory also seems like it would increase the likelihood of two Terraform processes trying to update the directory concurrently.

Thanks. I don't think I read anything to discourage using a global (shared) provider cache among multiple users. We've done this because all of the binaries are very large, and we've got a single system with upwards of 50 users using TF configurations across 40+ accounts, with lots of directories.

I suppose the short term solution is to update the configuration to use per-user plugin cache until there is some multi-user concurrent safe mechanism available. Is such a thing a near term possibility? My take on your references is that it's not trivial.

I've currently dropped back to 1.3.9, but I'll test out the per-user cache change.

reegnz commented 1 year ago

I'm also seeing this error after upgrading to 1.4.2:

Initializing provider plugins...
- Finding hashicorp/aws versions matching ">= 4.9.0, ~> 4.9"...
- Installing hashicorp/aws v4.60.0...
╷
│ Error: Failed to install provider
│
│ Error while installing hashicorp/aws v4.60.0: chmod
│ /home/reegnz/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.60.0/linux_amd64/terraform-provider-aws_v4.60.0_x5: operation not permitted
╵

We're using a shared provider cache directory, and a system user is caching the providers daily. This used to work on older terraform versions.

My current terraform provider config is very basic:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.9"
    }
  }
}

I'm using vanilla terraform init, no flags or nothing. It definitely feels like a bug, even if it's an 'undocumented' feature that it worked in the past, I believe it's still a regression.

that this was working on older versions was only by chance due to relying on some implementation details that have changed in v1.4 to ensure that Terraform is always able to update the dependency lock file correctly.

This doesn't change the fact that 1.4.x broke backward compatibility on how plugin_cache_dir can be used. There were no docs discouraging using a shared folder as cache so at least docs should be discouraging that use-case.

In the end I also downgraded to 1.3.9.

daniel-anova commented 1 year ago

Experiencing something similar when using terraform 1.4 with terragrunt 0.45.0 and TF_PLUGIN_CACHE_DIR defined.

I get the following errors:

❯ terragrunt run-all init
INFO[0020] The stack at /home/<redacted> will be processed in the following order for command init:
Group 1
- Module /home/<redacted>

╷
│ Error: Required plugins are not installed
│
│ The installed provider plugins are not consistent with the packages
│ selected in the dependency lock file:
│   - registry.terraform.io/hashicorp/azurerm: the cached package for registry.terraform.io/hashicorp/azurerm 3.50.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
│
│ Terraform uses external plugins to integrate with a variety of different
│ infrastructure services. To download the plugins required for this
│ configuration, run:
│   terraform init
╵
╷
│ Error: Required plugins are not installed
│
│ The installed provider plugins are not consistent with the packages
│ selected in the dependency lock file:
│   - registry.terraform.io/hashicorp/azurerm: the cached package for registry.terraform.io/hashicorp/azurerm 3.50.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
│
│ Terraform uses external plugins to integrate with a variety of different
│ infrastructure services. To download the plugins required for this
│ configuration, run:
│   terraform init
╵
# ... repeating the same error a few more times until it fully fails

Note that in the errors it's trying to download the latest provider instead of azurerm 3.35 which is required by the module.

Doing unset to TF_PLUGIN_CACHE_DIR before attempting to do the terraform init allows it to work as before 1.4 using the correct versions.

❯ unset TF_PLUGIN_CACHE_DIR
❯ terragrunt run-all init
INFO[0021] The stack at /home/<redacted> will be processed in the following order for command init:
Group 1
- Module /home/<redacted>

Initializing the backend...

Successfully configured the backend "azurerm"! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing modules...

So It doesn't seem to be a simple file ownership issue as I'm the only user.

lingwooc commented 1 year ago

+1 Terragrunt 1.4.+ completely breaks caching for us and we are constantly getting the above errors. Nothing is ever found from cache.

We also find when we specify

    google = {
      source  = "hashicorp/google"
      version     = "= 4.59.0"
    }

we get an intermittent checksum error for 4.60.0. I've not managed to pin down exactly what makes it happen or go away, but I am deleteting caches and lock files.

This is on ubuntu 20.04 on WSL2 via terragrunt, but 1.3.9 works fine.

BondLi1 commented 1 year ago

I get a similar problem in 1.4.4, I use configuration provider_installation and plugin_cache_dir. I try to upgrade from 1.3.7 to 1.4.4, after I use 1.4.4, I get an intermittent error like the screenshot

image

I use terragrunt 0.44.5 for terraform 1.4.4, I am not sure the problem from terragrunt or terraform. no problem in terragrunt 0.42.7 with terraform 1.3.7.

UncleGedd commented 1 year ago

Similar problem running on EKS. Terraform version 1.4.6, Terragrunt version 0.45.6 Screenshot 2023-05-08 at 10 04 21 AM

lorengordon commented 1 year ago

We also just ran into this problem trying to upgrade to tf 1.4. In our case, we operate a monorepo with many tf configs, and the build system operates across all of them. To reduce network calls, we stage the provider bundle on the build system before running terraform. We did that by generating the bundle out of band, zipping, and rehosting it. Then when the build system runs the job, we download the bundle and extract it to the cache directory. We then run init -upgrade=false -plugin-dir=${TF_PLUGIN_CACHE_DIR} to ensure tf doesn't try to connect out.

Which I explain mostly to highlight this isn't a multi-user shared setup, just a setup where terraform init should not be getting providers from the Internet. However, with tf 1.4, this setup fails:

Error while installing hashicorp/aws v4.67.0: cannot install existing
provider directory
/home/build-user/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.67.0/linux_amd64
to itself
lorengordon commented 1 year ago

We also just ran into this problem trying to upgrade to tf 1.4. In our case, we operate a monorepo with many tf configs, and the build system operates across all of them. To reduce network calls, we stage the provider bundle on the build system before running terraform. We did that by generating the bundle out of band, zipping, and rehosting it. Then when the build system runs the job, we download the bundle and extract it to the cache directory. We then run init -upgrade=false -plugin-dir=${TF_PLUGIN_CACHE_DIR} to ensure tf doesn't try to connect out.

Which I explain mostly to highlight this isn't a multi-user shared setup, just a setup where terraform init should not be getting providers from the Internet. However, with tf 1.4, this setup fails:

Error while installing hashicorp/aws v4.67.0: cannot install existing
provider directory
/home/build-user/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.67.0/linux_amd64
to itself

Fwiw, we were able to fix things in our usage by unsetting the TF_PLUGIN_CACHE_DIR environment variable, and instead exporting TF_CLI_ARGS_init...

export TF_CLI_ARGS_init="-no-color -upgrade=false -plugin-dir=${HOME}/.terraform.d/plugin-cache"
mkusmiy commented 1 year ago

setting TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE env var for my terraform init resolved it for me when working with TF v1.4.6 and v1.5.1 locally https://github.com/hashicorp/terraform/pull/32494

BondLi1 commented 1 year ago

I get a similar problem in 1.4.4, I use configuration provider_installation and plugin_cache_dir. I try to upgrade from 1.3.7 to 1.4.4, after I use 1.4.4, I get an intermittent error like the screenshot image

I use terragrunt 0.44.5 for terraform 1.4.4, I am not sure the problem from terragrunt or terraform. no problem in terragrunt 0.42.7 with terraform 1.3.7.

Wizardly, today, I tried to upgrade again, I tried to upgrade Terraform to 1.4.6, but it didn't set TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE env var, just make sure all versions of providers that I need already cached in local share cache directory before I run terraform, then problem gone. So, I don't don't know if 1.4.6 fix this problem or make sure all versions of providers that I need already cached in local share cache directory fix this problem?

BondLi1 commented 1 year ago

I get a similar problem in 1.4.4, I use configuration provider_installation and plugin_cache_dir. I try to upgrade from 1.3.7 to 1.4.4, after I use 1.4.4, I get an intermittent error like the screenshot image I use terragrunt 0.44.5 for terraform 1.4.4, I am not sure the problem from terragrunt or terraform. no problem in terragrunt 0.42.7 with terraform 1.3.7.

Wizardly, today, I tried to upgrade again, I tried to upgrade Terraform to 1.4.6, but it didn't set TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE env var, just make sure all versions of providers that I need already cached in local share cache directory before I run terraform, then problem gone. So, I don't don't know if 1.4.6 fix this problem or make sure all versions of providers that I need already cached in local share cache directory fix this problem?

I tested, if nothing in the local share cache directory, concurrent run multi-terraform instances(4 instances) with the same local share cache directory, the problem is still here.

I found when multi-terraform instances are running, AWS provider cache is very big, normally, the AWS provider cache should is 346M, but I see it is 819M when multi-terraform instances are running.

So I guess the cause is AWS provider cache is very big, when multi-terraform instances are running, multi-terraform instances download the AWS provider and write it to the same local cache file at the same time. so, the cache file is modified by other terraform instances and can't match any of the checksums recorded in the dependency lock file when another terraform instance is reading it.

So, I think make sure all versions of providers that I need are already cached in the local share cache directory before run Terraform, then can bypass this issue.

image image image