Closed eeaton closed 2 weeks ago
@apeabody @daniel-cit can you take a look at this and check if my understanding is correct? (And if I'm off-base, any idea how else we might fix this?)
Hi @eeaton!
Terraform automatically picks the latest provider versions allowed by the version constraints during the init stage. So any post 5.38.0
release CI runs should use it, unless there is a maximum version constraint present (typically we only use major version for these, such as the <6.0.0
below).
For example:
TestBootstrap 2024-07-29T17:13:11Z command.go:185: Initializing provider plugins...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/time versions matching ">= 0.5.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/google-beta versions matching ">= 3.43.0, >= 3.50.0, >= 3.67.0, >= 3.77.0, >= 4.11.0, >= 4.17.0, != 4.31.0, >= 4.64.0, >= 5.7.0, >= 5.22.0, < 6.0.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/random versions matching ">= 2.1.0, >= 2.2.0, >= 3.1.0, ~> 3.4"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/null versions matching ">= 2.1.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/external versions matching ">= 2.2.2"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/google versions matching ">= 3.33.0, >= 3.43.0, >= 3.50.0, >= 3.53.0, >= 3.67.0, >= 3.77.0, >= 3.83.0, >= 4.17.0, >= 4.25.0, >= 4.28.0, != 4.31.0, >= 4.64.0, >= 5.7.0, >= 5.22.0, < 6.0.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installing hashicorp/null v3.2.2...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installed hashicorp/null v3.2.2 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installing hashicorp/external v2.3.3...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installed hashicorp/external v2.3.3 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:12Z command.go:185: - Installing hashicorp/google v5.38.0...
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installed hashicorp/google v5.38.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installing hashicorp/time v0.12.0...
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installed hashicorp/time v0.12.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:14Z command.go:185: - Installing hashicorp/google-beta v5.38.0...
TestBootstrap 2024-07-29T17:13:15Z command.go:185: - Installed hashicorp/google-beta v5.38.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:15Z command.go:185: - Installing hashicorp/random v3.6.2...
TestBootstrap 2024-07-29T17:13:16Z command.go:185: - Installed hashicorp/random v3.6.2 (signed by HashiCorp)
https://github.com/GoogleCloudPlatform/terraform-google-cloud-functions/pull/135 only updates examples, so I doubt that it will make an impact.
If you have a recent CI test failure, we can determine the TPG version used for the failing stage.
Looks like this error occurs with TPG v5.39: https://github.com/terraform-google-modules/terraform-example-foundation/pull/1293#issuecomment-2258890806
Lets try to remove the LOG_EXECUTION_ID
that was added to see if it can pass the permadiff #1307
the version of the provider used in the build is one that includes fix that added in 27 days ago https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/cloudfunctions2/resource_cloudfunctions2_function.go#L36C2-L48C14
it is in both 5.38 and 5,39
Lets try to remove the
LOG_EXECUTION_ID
that was added to see if it can pass the permadiff #1307the version of the provider used in the build is one that includes fix that added in 27 days ago https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/cloudfunctions2/resource_cloudfunctions2_function.go#L36C2-L48C14
it is in both 5.38 and 5,39
Failed even after removing the LOG_EXECUTION_ID
Error: Received unexpected error:
FatalError{Underlying: error while running command: exit status 1;
Error: Provider produced inconsistent final plan
When expanding the plan for
module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function
to include new values learned so far during apply, provider
"registry.terraform.io/hashicorp/google" produced an invalid new value for
.service_config[0].environment_variables: was null, but now
cty.MapVal(map[string]cty.Value{
"ROLES":cty.StringVal("roles/owner,roles/editor,roles/resourcemanager.organizationAdmin,roles/compute.networkAdmin,roles/compute.orgFirewallPolicyAdmin"),
"SOURCE_ID":cty.StringVal("organizations/REDACTED/sources/2027208248522035993")
}).
This is a bug in the provider, which should be reported in the provider's own
issue tracker.}
I will take a look at the provider and try to see what is the issue with the DiffSuppressFunc
@eeaton @apeabody
Foundation code originally had two entries in the runtime env variables, ROLES
and SOURCE_ID
service_config = {
service_account_email = google_service_account.cloudfunction.email
runtime_env_variables = {
ROLES = join(",", var.roles_to_monitor)
SOURCE_ID = google_scc_source.cai_monitoring.id
}
}
Using the code from the latest version of the provider with the permadiff fix we have
func environmentVariablesDiffSuppress(k, old, new string, d *schema.ResourceData) bool {
if k == "service_config.0.environment_variables.LOG_EXECUTION_ID" && new == "" {
return true
}
// Let diff be determined by environment_variables (above)
if strings.HasPrefix(k, "service_config.0.environment_variables.%") {
return true
}
// For other keys, don't suppress diff.
return false
}
This part if k == "service_config.0.environment_variables.LOG_EXECUTION_ID" && new == ""
is skipping comparison for the key LOG_EXECUTION_ID
on the map
This part if strings.HasPrefix(k, "service_config.0.environment_variables.%")
is ignoring diferences in the size of the old and new environment variables map
when we run Terraform plan we get
# module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function will be created
+ resource "google_cloudfunctions2_function" "function" {
...
+ service_config {
+ all_traffic_on_latest_revision = true
+ available_cpu = "1"
+ available_memory = "256M"
+ gcf_uri = (known after apply)
+ ingress_settings = "ALLOW_ALL"
+ max_instance_count = 100
+ max_instance_request_concurrency = (known after apply)
+ min_instance_count = 1
+ service = (known after apply)
+ service_account_email = (known after apply)
+ timeout_seconds = 60
+ uri = (known after apply)
}
}
and the service_config
map is missing the evironment_variables
map
if we remove the fix we get:
+ resource "google_cloudfunctions2_function" "function" {
...
+ service_config {
+ all_traffic_on_latest_revision = true
+ available_cpu = "1"
+ available_memory = "256M"
+ environment_variables = (known after apply) <=============
+ gcf_uri = (known after apply)
+ ingress_settings = "ALLOW_ALL"
+ max_instance_count = 100
+ max_instance_request_concurrency = (known after apply)
+ min_instance_count = 1
+ service = (known after apply)
+ service_account_email = (known after apply)
+ timeout_seconds = 60
+ uri = (known after apply)
}
}
And the environment_variables
is back at the plan for the configuration. It has (known after apply)
because the values of the map depend on resources that have not been created yet
Adding a log output in the code we can see the state of the variables
[DEBUG] provider.terraform-provider-google: 2024/07/30 23:56:19 [DEBUG] DiffSuppress k:"service_config.0.environment_variables.%" old:"" new:""
service_config.0.environment_variables.%
has the length of the Map and old and new values are empty because they do not exist yet.
somehow this code is saying that on the plan case there is no diff between not having a map and the new map.
changing the condition
from
if strings.HasPrefix(k, "service_config.0.environment_variables.%")
to
if strings.HasPrefix(k, "service_config.0.environment_variables.%") && new != ""
seems to fix the issue, and plan
, apply
, apply with changes
, and destroy
work as expected.
The same strategy of ignoring the size is used in other places to deal with labels that have key value pair added by the API when returning results.
I would like to pinpoint why this code snippet is failing in this case but seems to work in other cases.
there is an open issue in the provider for this https://github.com/hashicorp/terraform-provider-google/issues/18747
Thanks for the investigation and detailed writeup on the provider issue, Daniel. I've also escalated this on the internal bug with your assessment.
As a quick fix to unblock this repo, I'm trying PR #1311 to pin the provider version used in the cai_monitoring module to the last good version
PR opened in magic-modules https://github.com/GoogleCloudPlatform/magic-modules/pull/11333
TL;DR
CI tests have 100% failure rate since July 13th.
Tests are consistently broken by a perma-diff in Cloud Funcitons, with an error like the following:
Expected behavior
No response
Observed behavior
No response
Terraform Configuration
Terraform Version
Additional information
I'm having a hard time identifying what we need to update or modify to resolve this. It looks like the root issue was in the provider, here's what I've found so far:
There was a provider fix 27 days ago to fix a permadiff:
There is also a PR in terraform-google-cloud-function (which we reference) to use latest source has not been approved, from 27 days ago: https://github.com/GoogleCloudPlatform/terraform-google-cloud-functions/pull/135
From the related dates, I expect that the PR 135 needs to be merged into the CFT module, then we can update with this repo with the CFT module. However, I'm not certain because I can't find where the provider fix is pulled into the CFT module.