terraform-google-modules / terraform-example-foundation

Shows how the CFT modules can be composed to build a secure cloud foundation
https://cloud.google.com/architecture/security-foundations
Apache License 2.0
1.18k stars 701 forks source link

One or more users named in the policy do not belong to a permitted customer (step 0-bootstrap) #1272

Closed lpezet closed 1 week ago

lpezet commented 2 weeks ago

TL;DR

I'm going through https://github.com/terraform-google-modules/terraform-example-foundation/blob/master/0-bootstrap/README-GitHub.md. When running either step 21 or 31 (if letting the pipeline create the groups), the following error can (did) happen (I did obfuscate values, using example.com and fake org id):

Error: Error applying IAM policy for organization "1234567890": Error setting IAM policy for organization "1234567890": googleapi: Error 400: One or more users named in the policy do not belong to a permitted customer.
Details:
[
  {
    "@type": "type.googleapis.com/google.rpc.PreconditionFailure",
    "violations": [
      {
        "description": "User gcp-organization-admins@example.com is not in permitted organization.",
        "subject": "orgpolicy:organizations/1234567890?configvalue=gcp-organization-admins%example.com",
        "type": "constraints/iam.allowedPolicyMemberDomains"
      }
    ]
  }
]
, failedPrecondition

  with module.seed_bootstrap.google_organization_iam_member.org_admin_serviceusage_consumer[0],
  on .terraform/modules/seed_bootstrap/main.tf line 252, in resource "google_organization_iam_member" "org_admin_serviceusage_consumer":
 252: resource "google_organization_iam_member" "org_admin_serviceusage_consumer" {

Error: Error applying IAM policy for storage bucket "b/***": Error setting IAM policy for storage bucket "b/***": googleapi: Error 400: Group gcp-organization-admins@example.com does not exist., invalid

  with module.seed_bootstrap.google_storage_bucket_iam_member.orgadmins_state_iam[0],
  on .terraform/modules/seed_bootstrap/main.tf line 276, in resource "google_storage_bucket_iam_member" "orgadmins_state_iam":
 276: resource "google_storage_bucket_iam_member" "orgadmins_state_iam" {

Expected behavior

Running terraform apply only once.

Observed behavior

Going through https://github.com/terraform-google-modules/terraform-example-foundation/blob/master/0-bootstrap/README-GitHub.md, I had this issue at step 23. Run terraform apply. I re-ran it and it went fine. I encountered issue #1206 and after running through fix https://github.com/terraform-google-modules/terraform-example-foundation/issues/1206#issuecomment-2082315445, step 31. The Pull request will trigger... gave the same error.

Terraform Configuration

# terraform.tfvars
org_id = "1234567890" 
billing_account = "000000-000000-000000"
groups = {
  create_required_groups = true # Change to true to create the required_groups
  create_optional_groups = false # Change to true to create the optional_groups
  billing_project = "project-1234" # Fill to create required or optional groups
  required_groups = {
    group_org_admins     = "gcp-organization-admins@example.com"
    group_billing_admins = "gcp-billing-admins@example.com"
    billing_data_users   = "gcp-billing-data@example.com"
    audit_data_users     = "gcp-audit-data@example.com"
  }
}

default_region     = "us-central1"
default_region_2   = "us-west1"
default_region_gcs = "US"
default_region_kms = "us"

/* ----------------------------------------
    Specific to github_bootstrap
   ---------------------------------------- */
gh_repos = {
  owner        = "someone",
  bootstrap    = "example-bootstrap",
  organization = "example-org",
  environments = "example-envs",
  networks     = "example-nets",
  projects     = "example-projs",
}

Terraform Version

Terraform v1.8.3
on linux_amd64
+ provider registry.terraform.io/hashicorp/google v5.34.0
+ provider registry.terraform.io/hashicorp/google-beta v5.34.0
+ provider registry.terraform.io/hashicorp/null v3.2.2
+ provider registry.terraform.io/hashicorp/random v3.6.2
+ provider registry.terraform.io/hashicorp/time v0.11.2
+ provider registry.terraform.io/integrations/github v5.34.0

Additional information

I believe the fix (I'll propose one) is for module.seed_bootstrap to depend on module.required_group, so that groups are created first before terraform-google-modules/bootstrap/google (module.seed_bootstrap) execute the google_organization_iam_member resources.

lpezet commented 2 weeks ago

Submitted PR #1273 to fix this issue.

eeaton commented 1 week ago

1273 has been merged, but from reading the details of #1206 I'm not certain whether this solves the issue. (From 1206, it looks like there are inconsistent permissions based on whether groups are created manually on the admin console or as part of the automation using service accounts).

@lpezet can you please confirm whether you're still seeing the issue after this change?

lpezet commented 1 week ago

@eeaton The behavior I mentioned in #1273 was happening before and after implementing the fix from #1206. I'll re-run it as soon as I get the chance (been busy) but if I can confirm my fix does address the issue, I'd love to find a way to add that in the tests (is it possible to "delay"/slow down group creation before the seed project configuration?).

lpezet commented 1 week ago

@eeaton It's proving difficult to destroy everything 0-bootstrap created. I only provided the minimum (org_id, billing_account, groups object, default_region* and gh_repos information in terraform.tfvars) and I now realize I should have looked at bucket_tfstate_kms_force_destroy and bucket_force_destroy variables as well to make it possible to redo this whole process again and again (something I wanted to do from the beginning). Now running into issues like:

│ Error: error loading state: Failed to open state file at gs://bkt-prj-b-seed-tfstate-XXXX/terraform/bootstrap/state/default.tfstate: googleapi: got HTTP response code 403 with body: <?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Permission denied on Cloud KMS key. Please ensure that your Cloud Storage service account has been authorized to use this key.</Message></Error>

If you have any tips on what to specify/do at the beginning to be able to go through 0-bootstrap and then destroy everything cleanly to repeat, please let me know so I can use that next time.

eeaton commented 1 week ago

I'd love to find a way to add that in the tests (is it possible to "delay"/slow down group creation before the seed project configuration?).

In general yes, and we do have a number of sleep timers and retry logic where resources aren't available to reference on GCP immediately after terraform apply commands. However, I subsequently added some details to #1206 that identifies the root cause as a permissions issue, so I don't think adding more sleep timers would make a difference here.

│ Error: error loading state: Failed to open state file at gs://bkt-prj-b-seed-tfstate-XXXX/terraform/bootstrap/state/default.tfstate: googleapi: got HTTP response code 403 with body: <?xml version='1.0' encoding='UTF-8'?>AccessDeniedPermission denied on Cloud KMS key. Please ensure that your Cloud Storage service account has been authorized to use this key.

From the error, you might have cryptoshredded yourself (deleting the encryption key makes resources completely inaccessible).

A few things to try:

bucket_tfstate_kms_force_destroy and bucket_force_destroy ... any tips

lpezet commented 1 week ago

In general yes, and we do have a number of sleep timers and retry logic where resources aren't available to reference on GCP immediately after terraform apply commands. However, I subsequently added some details to #1206 that identifies the root cause as a permissions issue, so I don't think adding more sleep timers would make a difference here.

I meant it as a way to confirm this is an issue by adding sleep timer(s) in the test (when creating required groups) to see if the seed_bootstrap module breaks with the error I experienced (thereby replicating my situation). This is a race condition in the end, isn't it? Then test with my fix to see whether it ]addresses the issue or not. That's what I meant. Sorry for the confusion.

I did cryptoshred myself, didn't I? lol Thanks for the tips.

eeaton commented 1 week ago

From the discussion 1206 I don't think this is a race condition, it looks like different permissions applied when the service account creates groups (service account automatically gets OWNER permission on the Cloud Identity resources) vs when the user manually creates groups on Cloud Identity admin console (service account doesn't have any permissions for Cloud Identity, which manages permissions outside of GC IAM policies). I've made it a backlog item to improve the overall guidance to steer people away from this edge case in a future release.

I'll close this issue for now, but feel free to re-open if you disagree.

lpezet commented 1 week ago

@eeaton My bad. I was referring to this issue, #1272, and NOT #1206 all this time. When you said:

1273 has been merged, but from reading the details of #1206 I'm not certain whether this solves the issue. (...)

@lpezet can you please confirm whether you're still seeing the issue after this change?

I thought you meant whether fix #1273 addressed this issue #1272, based on what was said in #1206. I can confirm fix #1273 worked for me but I would have liked to contribute a way to effectively test fix #1273 but I don't fully understand how the tests work and couldn't find anything relevant at first sight in test/integration/bootstrap/bootstrap_test.go.

eeaton commented 1 week ago

Got it, thanks for clarifying.

If you're interested, here's a codelab introducing the test framework used by this repo and others based off of CFT: https://codelabs.developers.google.com/cft-onboarding

For this particular repo, though, I think running all the tests locally is an unreasonable burden for contributors trying to make a small fix. (Even assuming everything goes smoothly, it takes multiple hours to deploy all the infra and run tests and tear down again)

When you raise a PR, all the tests run on the backend before it can be approved to merge. My practical rule of thumb for this enormous repo is to run the minimum locally: make docker_test_lint and make docker generate_docs to catch obvious issues, then leave the detailed tests to the CI workflow triggered on a PR.