terraform-google-modules / terraform-example-foundation

Shows how the CFT modules can be composed to build a secure cloud foundation
https://cloud.google.com/architecture/security-foundations
Apache License 2.0
1.18k stars 701 forks source link

CI improvement: Remove explicit steps that are already addressed by org policies #1243

Open eeaton opened 1 month ago

eeaton commented 1 month ago

TL;DR

Investigating the cause of flaky CI errors, I'm seeing a high rate of the following issues that are set in the project factory module but can be better addressed through organization policies:

  1. 403 error when attempting to delete the default VPC, with an error that the Compute Engine API has not yet been enabled. At that point the enablement request has started, but may a long time to complete. This then triggers retry logic that leads to a different error, 409 project already exists when attempting to create the project again. (There's some inconsistent state between GCP and terraform).

However, it is not necessary to delete a default VPC if it is blocked by org policy. Provider docs state it is recommended to use the organisational policy constraint instead of setting auto_create_network to false, as is done in the project factory.

The default behavior of the project factory is a bit nonintuitive. Because the GCP platform creates a default network by default, the project factory module overrides this with auto_create_network = false. This behavior enables the Compute API, queries it for the auto-created network, and then attempts to delete the default VPC. However, it can introduce issues with eventual consistency. Conversely, when auto_create_network = true, the project factory does not attempt to query the Compute API. If the org policy to prevent the default network is enforced, and auto_created_network = true, we get the desired (if non-intuitive) behavior to not create a default VPC and not try to immediately query Compute API at project creation.

  1. 409 error about default service account does not exist occasionally appears on brand new projects. I suspect this is a similar issue, where the ability to reference the service account is eventually consistent, and there is a gap between GCP state and terraform state.

Note that the provider docs also state this tf resource is a best-effort basis, as no API formally describes the default service account resource and it is only intended for use cases that can't use the org policy.

The foundation blueprint already sets these org policies, so I expect we can remove some of these flaky errors about eventual consistency by setting the org policies first and avoiding these steps.

Terraform Resources

Projects that explicitly try to deprivilege the service account. After the org policy is enforced, this is no longer necessary. However, the org policy is created in stage 1-org and is eventually consistent, and some projects are also created in 1-org, so it's tricky to guarantee that the policy is actually enforced before projects are created.

terraform-google-project-factory module by default has auto_create_network set to false. In comparison, the google_project resource from Google provider defaults this to true. This means the project factory always attempts to enable the Compute Engine API, create the default network, then immediately delete it. This step is not necessary if the org policy is already in place.

Detailed design

The goal of removing the default VPC and deprivileging the default service account is already addressed by Org policies compute.skipDefaultNetworkCreation" and iam.automaticIamGrantsForDefaultServiceAccounts in 1-org step. After these policies are enforced, there is no need to explicitly delete the default VPC or disable the default service account; conversely, attempting to do these actions contributes to flaky failures when trying to reference APIs or resources whose state is eventually consistent.

Fixes:

Additional information

Sample error logs for #1

Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.interconnect.module.project-factory.google_project_default_service_accounts.default_service_accounts[0]: Creating... Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.base_network_hub[0].module.project-factory.google_project_default_service_accounts.default_service_accounts[0]: Creation complete after 0s [id=projects/tyj-net-hub-base-0pux] Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.org_billing_logs.module.budget.data.google_project.project[0]: Reading... Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.org_billing_logs.module.budget.data.google_project.project[0]: Read complete after 1s [id=projects/tyj-c-billing-logs-uglg] Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: google_project_iam_member.billing_bq_viewer: Creating... Step #7 - "converge-org": TestOrg 2024-05-21T15:19:05Z command.go:185: module.interconnect.module.project-factory.google_project_default_service_accounts.default_service_accounts[0]: Creation complete after 1s [id=projects/tyj-net-interconnect-xy8h] [...]

[...] Step #7 - "converge-org": Error: Received unexpected error: Step #7 - "converge-org": FatalError{Underlying: error while running command: exit status 1; Step #7 - "converge-org": Error: error creating project tyj-net-dns-oo3v (tyj-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the roles/resourcemanager.projectCreator permission Step #7 - "converge-org":
Step #7 - "converge-org": with module.dns_hub.module.project-factory.google_project.main, Step #7 - "converge-org": on .terraform/modules/dns_hub/modules/core_project_factory/main.tf line 73, in resource "google_project" "main": Step #7 - "converge-org": 73: resource "google_project" "main" { Step #7 - "converge-org":
Step #7 - "converge-org":
Step #7 - "converge-org": Error: Error creating service account: googleapi: Error 409: Service account project-service-account already exists within project projects/tyj-

Step #7 - "converge-org": Error: Error deleting default network in project eyx-net-dns-rtqu: googleapi: Error 403: Compute Engine API has not been used in project eyx-net-dns-rtqu before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=eyx-net-dns-rtqu then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry. [...] Step #7 - "converge-org": Error: error creating project eyx-net-dns-rtqu (eyx-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the roles/resourcemanager.projectCreator permission

Sample error logs for 2:

Step #7 - "converge-org": TestOrg 2024-05-20T06:08:01Z command.go:185: on .terraform/modules/base_restricted_environment_network.base_shared_vpc_host_project/modules/core_project_factory/main.tf line 145, in resource "google_service_account" "default_service_account": Step #7 - "converge-org": TestOrg 2024-05-20T06:08:01Z command.go:185: 145: resource "google_service_account" "default_service_account" { Step #7 - "converge-org": TestOrg 2024-05-20T06:08:01Z command.go:185: Step #7 - "converge-org": TestOrg 2024-05-20T06:08:02Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; Step #7 - "converge-org": Error: Error creating service account: googleapi: Error 409: Service account project-service-account already exists within project projects/mkz-n-shared-base-xx88. Step #7 - "converge-org": Details: Step #7 - "converge-org": [ Step #7 - "converge-org": { Step #7 - "converge-org": "@type": "type.googleapis.com/google.rpc.ResourceInfo", Step #7 - "converge-org": "resourceName": "projects/mkz-n-shared-base-xx88/serviceAccounts/project-service-account@mkz-n-shared-base-xx88.iam.gserviceaccount.com" Step #7 - "converge-org": } Step #7 - "converge-org": ] Step #7 - "converge-org": , alreadyExists Step #7 - "converge-org": Step #7 - "converge-org": with module.base_restricted_environment_network["nonproduction"].module.base_shared_vpc_host_project.module.project-factory.google_service_account.default_service_account[0], Step #7 - "converge-org": on .terraform/modules/base_restricted_environment_network.base_shared_vpc_host_project/modules/core_project_factory/main.tf line 145, in resource "google_service_account" "default_service_account": Step #7 - "converge-org": 145: resource "google_service_account" "default_service_account" { Step #7 - "converge-org": } Step #7 - "converge-org": Test: TestOrg Step #7 - "converge-org": 2024/05/20 06:08:02 RUN_STAGE env var set to apply Step #7 - "converge-org": 2024/05/20 06:08:02 Skipping stage teardown Step #7 - "converge-org": --- FAIL: TestOrg (1015.64s) Step #7 - "converge-org": FAIL Step #7 - "converge-org": FAIL github.com/terraform-google-modules/terraform-example-foundation/test/integration/org 1015.695s Step #7 - "converge-org": FAIL

eeaton commented 1 month ago

Hi @apeabody , looks like you made a recent change on the core-project-factory module https://github.com/terraform-google-modules/terraform-google-project-factory/commit/cfd7f3f15e0866fe09cc5ec4a2f8e94398c773d9 that might obviate the fix I'm working on.

I started working on a new PR to override the default behavior of the core-project-factory (replace default_service_account = "disable" with default_service_account = "keep"), so that it would not attempt to create the unsupported resource, but is the upstream change on the provider another way to fix this? Or is it an unrelated issue?

apeabody commented 1 month ago

Hi @eeaton! - Likely. We often see 409 error about default service account does exist as there was an earlier Error: Provider produced inconsistent result after apply during the creation of the project factory service account, which then fails during the subsequent terraform retry as it does exist. If your example is this situation (and likely it is), the upstream change should resolve.

Note: Here is the PR for the updated version: https://github.com/terraform-google-modules/terraform-example-foundation/pull/1221

eeaton commented 1 month ago

Good news, thanks. I'm seeing quite a few of those 409 errors on terraform retry, so I'll prioritize getting 1221 merged and see if that helps reduce the errors.