Open eeaton opened 1 month ago
Hi @apeabody , looks like you made a recent change on the core-project-factory module https://github.com/terraform-google-modules/terraform-google-project-factory/commit/cfd7f3f15e0866fe09cc5ec4a2f8e94398c773d9 that might obviate the fix I'm working on.
I started working on a new PR to override the default behavior of the core-project-factory (replace default_service_account = "disable"
with default_service_account = "keep"
), so that it would not attempt to create the unsupported resource, but is the upstream change on the provider another way to fix this? Or is it an unrelated issue?
Hi @eeaton! - Likely. We often see 409 error about default service account does exist
as there was an earlier Error: Provider produced inconsistent result after apply
during the creation of the project factory service account, which then fails during the subsequent terraform retry as it does exist. If your example is this situation (and likely it is), the upstream change should resolve.
Note: Here is the PR for the updated version: https://github.com/terraform-google-modules/terraform-example-foundation/pull/1221
Good news, thanks. I'm seeing quite a few of those 409 errors on terraform retry, so I'll prioritize getting 1221 merged and see if that helps reduce the errors.
TL;DR
Investigating the cause of flaky CI errors, I'm seeing a high rate of the following issues that are set in the project factory module but can be better addressed through organization policies:
However, it is not necessary to delete a default VPC if it is blocked by org policy. Provider docs state it is recommended to use the organisational policy constraint instead of setting auto_create_network to false, as is done in the project factory.
The default behavior of the project factory is a bit nonintuitive. Because the GCP platform creates a default network by default, the project factory module overrides this with
auto_create_network = false
. This behavior enables the Compute API, queries it for the auto-created network, and then attempts to delete the default VPC. However, it can introduce issues with eventual consistency. Conversely, whenauto_create_network = true
, the project factory does not attempt to query the Compute API. If the org policy to prevent the default network is enforced, and auto_created_network = true, we get the desired (if non-intuitive) behavior to not create a default VPC and not try to immediately query Compute API at project creation.Note that the provider docs also state this tf resource is a best-effort basis, as no API formally describes the default service account resource and it is only intended for use cases that can't use the org policy.
The foundation blueprint already sets these org policies, so I expect we can remove some of these flaky errors about eventual consistency by setting the org policies first and avoiding these steps.
Terraform Resources
Projects that explicitly try to deprivilege the service account. After the org policy is enforced, this is no longer necessary. However, the org policy is created in stage 1-org and is eventually consistent, and some projects are also created in 1-org, so it's tricky to guarantee that the policy is actually enforced before projects are created.
terraform-google-project-factory module by default has auto_create_network set to false. In comparison, the google_project resource from Google provider defaults this to true. This means the project factory always attempts to enable the Compute Engine API, create the default network, then immediately delete it. This step is not necessary if the org policy is already in place.
Detailed design
The goal of removing the default VPC and deprivileging the default service account is already addressed by Org policies
compute.skipDefaultNetworkCreation"
andiam.automaticIamGrantsForDefaultServiceAccounts
in 1-org step. After these policies are enforced, there is no need to explicitly delete the default VPC or disable the default service account; conversely, attempting to do these actions contributes to flaky failures when trying to reference APIs or resources whose state is eventually consistent.Fixes:
Additional information
Sample error logs for #1
[...] Step #7 - "converge-org": Error: Received unexpected error: Step #7 - "converge-org": FatalError{Underlying: error while running command: exit status 1; Step #7 - "converge-org": Error: error creating project tyj-net-dns-oo3v (tyj-net-dns): googleapi: Error 409: Requested entity already exists, alreadyExists. If you received a 403 error, make sure you have the
roles/resourcemanager.projectCreator
permission Step #7 - "converge-org":Step #7 - "converge-org": with module.dns_hub.module.project-factory.google_project.main, Step #7 - "converge-org": on .terraform/modules/dns_hub/modules/core_project_factory/main.tf line 73, in resource "google_project" "main": Step #7 - "converge-org": 73: resource "google_project" "main" { Step #7 - "converge-org":
Step #7 - "converge-org":
Step #7 - "converge-org": Error: Error creating service account: googleapi: Error 409: Service account project-service-account already exists within project projects/tyj-
Sample error logs for 2: