terraform-google-modules / terraform-example-foundation

Shows how the CFT modules can be composed to build a secure cloud foundation
https://cloud.google.com/architecture/security-foundations
Apache License 2.0
1.21k stars 705 forks source link

CI broken by perma-diff in Cloud Functions #1306

Closed eeaton closed 2 weeks ago

eeaton commented 1 month ago

TL;DR

CI tests have 100% failure rate since July 13th.

Tests are consistently broken by a perma-diff in Cloud Funcitons, with an error like the following:

Step #7 - "converge-org": TestOrg 2024-07-29T17:51:04Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; 
Step #7 - "converge-org": Error: Provider produced inconsistent final plan
Step #7 - "converge-org": 
Step #7 - "converge-org": When expanding the plan for
Step #7 - "converge-org": module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function
Step #7 - "converge-org": to include new values learned so far during apply, provider
Step #7 - "converge-org": "registry.terraform.io/hashicorp/google" produced an invalid new value for
Step #7 - "converge-org": .service_config[0].environment_variables: was null, but now
Step #7 - "converge-org": cty.MapVal(map[string]cty.Value{"LOG_EXECUTION_ID":cty.StringVal("true"),
Step #7 - "converge-org": "ROLES":cty.StringVal("roles/owner,roles/editor,roles/resourcemanager.organizationAdmin,roles/compute.networkAdmin,roles/compute.orgFirewallPolicyAdmin"),
Step #7 - "converge-org": "SOURCE_ID":cty.StringVal("organizations/943740911108/sources/17068193223143848275")}).
Step #7 - "converge-org": 
Step #7 - "converge-org": This is a bug in the provider, which should be reported in the provider's own
Step #7 - "converge-org": issue tracker.}
Step #7 - "converge-org":     apply.go:34: 
Step #7 - "converge-org":           Error Trace:    /builder/home/go/pkg/mod/github.com/gruntwork-io/terratest@v0.46.16/modules/terraform/apply.go:34
Step #7 - "converge-org":                                       /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.0/pkg/tft/terraform.go:571
Step #7 - "converge-org":                                       /workspace/test/integration/org/org_test.go:101
Step #7 - "converge-org":                                       /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.0/pkg/tft/terraform.go:630
Step #7 - "converge-org":                                       /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.0/pkg/tft/terraform.go:669
Step #7 - "converge-org":                                       /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.0/pkg/utils/stages.go:31
Step #7 - "converge-org":                                       /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.0/pkg/tft/terraform.go:669

Expected behavior

No response

Observed behavior

No response

Terraform Configuration

the automated Cloud Build suite for this repo

Terraform Version

the automated Cloud Build suite for this repo

Additional information

I'm having a hard time identifying what we need to update or modify to resolve this. It looks like the root issue was in the provider, here's what I've found so far:

There was a provider fix 27 days ago to fix a permadiff:

There is also a PR in terraform-google-cloud-function (which we reference) to use latest source has not been approved, from 27 days ago: https://github.com/GoogleCloudPlatform/terraform-google-cloud-functions/pull/135

From the related dates, I expect that the PR 135 needs to be merged into the CFT module, then we can update with this repo with the CFT module. However, I'm not certain because I can't find where the provider fix is pulled into the CFT module.

eeaton commented 1 month ago

@apeabody @daniel-cit can you take a look at this and check if my understanding is correct? (And if I'm off-base, any idea how else we might fix this?)

apeabody commented 1 month ago

Hi @eeaton!

Terraform automatically picks the latest provider versions allowed by the version constraints during the init stage. So any post 5.38.0 release CI runs should use it, unless there is a maximum version constraint present (typically we only use major version for these, such as the <6.0.0 below).

For example:

TestBootstrap 2024-07-29T17:13:11Z command.go:185: Initializing provider plugins...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/time versions matching ">= 0.5.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/google-beta versions matching ">= 3.43.0, >= 3.50.0, >= 3.67.0, >= 3.77.0, >= 4.11.0, >= 4.17.0, != 4.31.0, >= 4.64.0, >= 5.7.0, >= 5.22.0, < 6.0.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/random versions matching ">= 2.1.0, >= 2.2.0, >= 3.1.0, ~> 3.4"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/null versions matching ">= 2.1.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/external versions matching ">= 2.2.2"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/google versions matching ">= 3.33.0, >= 3.43.0, >= 3.50.0, >= 3.53.0, >= 3.67.0, >= 3.77.0, >= 3.83.0, >= 4.17.0, >= 4.25.0, >= 4.28.0, != 4.31.0, >= 4.64.0, >= 5.7.0, >= 5.22.0, < 6.0.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installing hashicorp/null v3.2.2...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installed hashicorp/null v3.2.2 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installing hashicorp/external v2.3.3...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installed hashicorp/external v2.3.3 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:12Z command.go:185: - Installing hashicorp/google v5.38.0...
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installed hashicorp/google v5.38.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installing hashicorp/time v0.12.0...
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installed hashicorp/time v0.12.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:14Z command.go:185: - Installing hashicorp/google-beta v5.38.0...
TestBootstrap 2024-07-29T17:13:15Z command.go:185: - Installed hashicorp/google-beta v5.38.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:15Z command.go:185: - Installing hashicorp/random v3.6.2...
TestBootstrap 2024-07-29T17:13:16Z command.go:185: - Installed hashicorp/random v3.6.2 (signed by HashiCorp)

https://github.com/GoogleCloudPlatform/terraform-google-cloud-functions/pull/135 only updates examples, so I doubt that it will make an impact.

If you have a recent CI test failure, we can determine the TPG version used for the failing stage.

apeabody commented 1 month ago

Looks like this error occurs with TPG v5.39: https://github.com/terraform-google-modules/terraform-example-foundation/pull/1293#issuecomment-2258890806

daniel-cit commented 1 month ago

Lets try to remove the LOG_EXECUTION_ID that was added to see if it can pass the permadiff #1307

the version of the provider used in the build is one that includes fix that added in 27 days ago https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/cloudfunctions2/resource_cloudfunctions2_function.go#L36C2-L48C14

it is in both 5.38 and 5,39

daniel-cit commented 1 month ago

Lets try to remove the LOG_EXECUTION_ID that was added to see if it can pass the permadiff #1307

the version of the provider used in the build is one that includes fix that added in 27 days ago https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/cloudfunctions2/resource_cloudfunctions2_function.go#L36C2-L48C14

it is in both 5.38 and 5,39

Failed even after removing the LOG_EXECUTION_ID

Error:          Received unexpected error:
                FatalError{Underlying: error while running command: exit status 1; 
                Error: Provider produced inconsistent final plan

                When expanding the plan for
                module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function
                to include new values learned so far during apply, provider
                "registry.terraform.io/hashicorp/google" produced an invalid new value for
                .service_config[0].environment_variables: was null, but now
                cty.MapVal(map[string]cty.Value{
                   "ROLES":cty.StringVal("roles/owner,roles/editor,roles/resourcemanager.organizationAdmin,roles/compute.networkAdmin,roles/compute.orgFirewallPolicyAdmin"),
                   "SOURCE_ID":cty.StringVal("organizations/REDACTED/sources/2027208248522035993")
                }).

                This is a bug in the provider, which should be reported in the provider's own
                issue tracker.}

I will take a look at the provider and try to see what is the issue with the DiffSuppressFunc

daniel-cit commented 1 month ago

@eeaton @apeabody

Foundation code originally had two entries in the runtime env variables, ROLES and SOURCE_ID

  service_config = {
    service_account_email = google_service_account.cloudfunction.email
    runtime_env_variables = {
      ROLES     = join(",", var.roles_to_monitor)
      SOURCE_ID = google_scc_source.cai_monitoring.id
    }
  }

Using the code from the latest version of the provider with the permadiff fix we have

func environmentVariablesDiffSuppress(k, old, new string, d *schema.ResourceData) bool {
    if k == "service_config.0.environment_variables.LOG_EXECUTION_ID" && new == "" {
        return true
    }

    // Let diff be determined by environment_variables (above)
    if strings.HasPrefix(k, "service_config.0.environment_variables.%") {
        return true
    }

    // For other keys, don't suppress diff.
    return false
}

This part if k == "service_config.0.environment_variables.LOG_EXECUTION_ID" && new == "" is skipping comparison for the key LOG_EXECUTION_ID on the map

This part if strings.HasPrefix(k, "service_config.0.environment_variables.%") is ignoring diferences in the size of the old and new environment variables map

when we run Terraform plan we get

# module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function will be created
  + resource "google_cloudfunctions2_function" "function" {
...

      + service_config {
          + all_traffic_on_latest_revision   = true
          + available_cpu                    = "1"
          + available_memory                 = "256M"
          + gcf_uri                          = (known after apply)
          + ingress_settings                 = "ALLOW_ALL"
          + max_instance_count               = 100
          + max_instance_request_concurrency = (known after apply)
          + min_instance_count               = 1
          + service                          = (known after apply)
          + service_account_email            = (known after apply)
          + timeout_seconds                  = 60
          + uri                              = (known after apply)
        }
    }

and the service_config map is missing the evironment_variablesmap

if we remove the fix we get:

  + resource "google_cloudfunctions2_function" "function" {
...

      + service_config {
          + all_traffic_on_latest_revision   = true
          + available_cpu                    = "1"
          + available_memory                 = "256M"
          + environment_variables            = (known after apply) <=============
          + gcf_uri                          = (known after apply)
          + ingress_settings                 = "ALLOW_ALL"
          + max_instance_count               = 100
          + max_instance_request_concurrency = (known after apply)
          + min_instance_count               = 1
          + service                          = (known after apply)
          + service_account_email            = (known after apply)
          + timeout_seconds                  = 60
          + uri                              = (known after apply)
        }
    }

And the environment_variables is back at the plan for the configuration. It has (known after apply) because the values of the map depend on resources that have not been created yet

Adding a log output in the code we can see the state of the variables

[DEBUG] provider.terraform-provider-google: 2024/07/30 23:56:19 [DEBUG] DiffSuppress  k:"service_config.0.environment_variables.%" old:"" new:""

service_config.0.environment_variables.% has the length of the Map and old and new values are empty because they do not exist yet.

somehow this code is saying that on the plan case there is no diff between not having a map and the new map.

changing the condition

from

if strings.HasPrefix(k, "service_config.0.environment_variables.%")

to

if strings.HasPrefix(k, "service_config.0.environment_variables.%") && new != ""

seems to fix the issue, and plan, apply, apply with changes, and destroy work as expected.

The same strategy of ignoring the size is used in other places to deal with labels that have key value pair added by the API when returning results.

I would like to pinpoint why this code snippet is failing in this case but seems to work in other cases.

daniel-cit commented 1 month ago

there is an open issue in the provider for this https://github.com/hashicorp/terraform-provider-google/issues/18747

eeaton commented 1 month ago

Thanks for the investigation and detailed writeup on the provider issue, Daniel. I've also escalated this on the internal bug with your assessment.

As a quick fix to unblock this repo, I'm trying PR #1311 to pin the provider version used in the cai_monitoring module to the last good version

daniel-cit commented 1 month ago

PR opened in magic-modules https://github.com/GoogleCloudPlatform/magic-modules/pull/11333