hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.29k stars 1.72k forks source link

Adding `google_compute_resource_policy` to existing instance fails with wrong service account #17260

Open rhoriguchi opened 7 months ago

rhoriguchi commented 7 months ago

Community Note

Terraform Version

Terraform v1.7.2-dev
on darwin_arm64

Your version of Terraform is out of date! The latest version
is 1.7.3. You can update by downloading from https://www.terraform.io/downloads.html

Affected Resource(s)

Terraform Configuration

...
    "google_compute_instance": {
      "vm_name": {
        ...
        "resource_policies": [
          "${google_compute_resource_policy.vm_name-scheduling-policy.id}"
        ],
        ...
      }
    },
 "vm_name-scheduling-policy": {
        "//": {
          "metadata": {
            "path": "project/vm_name-service/vm_name-scheduling-policy",
            "uniqueId": "vm_name-scheduling-policy"
          }
        },
        "instance_schedule_policy": {
          "time_zone": "Europe/Zurich",
          "vm_start_schedule": {
            "schedule": "0 7 * * MON-FRI"
          },
          "vm_stop_schedule": {
            "schedule": "0 19 * * *"
          }
        },
        "name": "vm_name-scheduling-policy",
        "region": "europe-west6"
      }
    },
...

Debug Output

Error: Error adding resource policies: googleapi: Error 412: Compute Engine System service account service-XXXXXXXXX@compute-system.iam.gserviceaccount.com needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation., conditionNotMet

  with google_compute_instance.vm_name (vm_name-service/vm_name),
  on cdk.tf.json line 314, in resource.google_compute_instance.vm_name (vm_name-service/vm_name):
 314:       }

Expected Behavior

Add service policy to existing compute instance.

Actual Behavior

Failing to add service policy to existing compute instance. The issue is that it uses the default (service-XXXXXXXXX@compute-system.iam.gserviceaccount.com) compute service account for the project while locally executing the plan with a custom service account. Everything else is executed with the custom service account (not on a compute instance). Why is the default compute service account used? I'm aware that adding a policy recreates the instance.

Steps to reproduce

  1. terraform apply

Important Factoids

No response

References

https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_resource_policy

b/328763035

edwardmedia commented 7 months ago

@rhoriguchi can you share the complete config with before and after the update as well as your debug log?

The error complains permissions. Did you check if that account has the mentioned permission? Reading the error, it seems this service account is used to stop & restart the instance when some config changes are applied

googleapi: Error 412: Compute Engine System service account service-XXXXXXXXX@compute-system.iam.gserviceaccount.com needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation., conditionNotMet

The error may not be limited to the changes mentioned in the subject. You may see the same error for other changes that require machine reboot

rhoriguchi commented 7 months ago

Sure thing @edwardmedia, I've created a test deployment. I've tried several combinations of it in different projects. Same behavior when trying to attach it to an existing VM. The service account used for the deployment has Compute Instance Admin (v1) as mentioned in the docs. The service account mentioned in the error is not the one used for the deployment, but the compute engine default service account.

Log File

TypeScript ```ts import { ComputeInstance } from '@cdktf/provider-google/lib/compute-instance'; import { ComputeResourcePolicy } from '@cdktf/provider-google/lib/compute-resource-policy'; import { Construct } from 'constructs'; import { GoogleProvider } from '@cdktf/provider-google/lib/provider'; import { TerraformStack } from 'cdktf'; export class TestDeployment extends TerraformStack { constructor(scope: Construct) { super(scope, 'test-deployment'); new GoogleProvider(this, 'google', { credentials: process.env.GCP_SERVICE_ACCOUNT_CREDENTIALS, project: 'SOME-PROJECT', }); const resourcePolicy = new ComputeResourcePolicy(this, 'test-policy', { name: 'test-policy', region: 'europe-west6', instanceSchedulePolicy: { timeZone: 'Europe/Zurich', vmStartSchedule: { schedule: '0 7 * * MON-FRI', }, vmStopSchedule: { schedule: '0 19 * * *', }, }, }); new ComputeInstance(this, 'test-vm', { name: 'test-vm', machineType: 'n2-standard-4', zone: 'europe-west6-a', bootDisk: { initializeParams: { image: 'debian-cloud/debian-11', }, }, networkInterface: [ { network: 'default', }, ], resourcePolicies: [resourcePolicy.id], }); } } ```
Terraform HCL ```tf { "//": { "metadata": { "backend": "local", "stackName": "test-deployment", "version": "0.20.2" }, "outputs": { } }, "provider": { "google": [ { "credentials": "XXXXXXXXXXXXX" "project": "SOME-PROJECT" } ] }, "resource": { "google_compute_instance": { "test-vm": { "//": { "metadata": { "path": "test-deployment/test-vm", "uniqueId": "test-vm" } }, "boot_disk": { "initialize_params": { "image": "debian-cloud/debian-11" } }, "machine_type": "n2-standard-4", "name": "test-vm", "network_interface": [ { "network": "default" } ], "resource_policies": [ "${google_compute_resource_policy.test-policy.id}" ], "zone": "europe-west6-a" } }, "google_compute_resource_policy": { "test-policy": { "//": { "metadata": { "path": "test-deployment/test-policy", "uniqueId": "test-policy" } }, "instance_schedule_policy": { "time_zone": "Europe/Zurich", "vm_start_schedule": { "schedule": "0 7 * * MON-FRI" }, "vm_stop_schedule": { "schedule": "0 19 * * *" } }, "name": "test-policy", "region": "europe-west6" } } }, "terraform": { "backend": { "local": { "path": "/PATH/terraform.test-deployment.tfstate" } }, "required_providers": { "google": { "source": "google", "version": "5.13.0" } } } } ```
edwardmedia commented 7 months ago

@rhoriguchi how many GCP projects are involved in your deployment? Below account is Compute Engine Service Agent which was created when you enabled the Compute Engine API on project 224845064652. Do the target resources reside on the same project?

service-224845064652@compute-system.iam.gserviceaccount.com

If not, what kind of relationships among the projects? Crossing different projects, you do need to consider build proper IAMs among them.

rhoriguchi commented 6 months ago

We are using a service principal from another project that has Editor permissions on the project we are deploying to. Everything can be deployed with no issues.

However when adding a resource policy to an instance it tries to use the Compute Engine default service account (so the service account GCP creates by default) to restart the instance instead of the service principal we are using to deploy the resources.

So the only way to fix the issue currently would be granding this service account VM restart permissions, which we do not want for normal operations outside of the deployment. Why isn't the service principal used for the terraform deployment used when adding a resource policy to an instance?

melinath commented 6 months ago

I would guess that's what's happening here is that, under the hood, the Compute API is trying to use its default service account for the project (regardless of who the authenticated user is.) You could likely confirm this by trying to use gcloud to make the same POST request that Terraform is failing on - it should fail in the same way.

And if it doesn't, that would give more information about what is causing the failure.

ggtisc commented 6 months ago

@rhoriguchi As I saw you are looking to add a service policy to existing compute instance avoiding using the 'compute engine default service account'.

You should check that your ADCs configuration is correct according to this https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/provider_reference (Authentication section) to ensure that you are taking the correct compute engine service account and not a the default, because if it is not declared there it is going to take the default.

Or specify it directly in the provider block like this:

provider "google" {
  credentials = file("/path/to/your/keyfile.json")
  project     = "your-project-id"
  region      = "your-region"
  zone        = "your-zone"
  service_account_email = "your-service-account@your-project.iam.gserviceaccount.com"
}
rhoriguchi commented 6 months ago

I would guess that's what's happening here is that, under the hood, the Compute API is trying to use its default service account for the project (regardless of who the authenticated user is.) You could likely confirm this by trying to use gcloud to make the same POST request that Terraform is failing on - it should fail in the same way.

And if it doesn't, that would give more information about what is causing the failure.

I've tried to reproduce it and I'm getting exaclty the same response using the API.

> curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token --impersonate-service-account=XXXXXXXXX@SOME-PROJECT.iam.gserviceaccount.com)" \
    -H 'Content-Type: application/json; charset=utf-8' \
    -d '{  "canIpForward": false,  "deletionProtection": false,  "disks": [   {    "autoDelete": true,    "boot": true,    "initializeParams": {     "sourceImage": "projects/debian-cloud/global/images/family/debian-11"    },    "mode": "READ_WRITE"   }  ],  "machineType": "projects/SOME-PROJECT/zones/europe-west6-a/machineTypes/n2-standard-4",  "metadata": {},  "name": "test-vm",  "networkInterfaces": [   {    "network": "projects/SOME-PROJECT/global/networks/default"   }  ],  "params": {},  "resourcePolicies": [   "projects/SOME-PROJECT/regions/europe-west6/resourcePolicies/test-policy"  ],  "scheduling": {   "automaticRestart": true  },  "tags": {} }' \
    'https://compute.googleapis.com/compute/v1/projects/SOME-PROJECT/zones/europe-west6-a/instances?alt=json&prettyPrint=false'

WARNING: This command is using service account impersonation. All API calls will be executed as [XXXXXXXXX@SOME-PROJECT.iam.gserviceaccount.com].
{
  "error": {
    "code": 412,
    "message": "Compute Engine System service account service-YYYYYYYYY@compute-system.iam.gserviceaccount.com needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation.",
    "errors": [
      {
        "message": "Compute Engine System service account service-YYYYYYYYY@compute-system.iam.gserviceaccount.com needs to have [compute.instances.start,compute.instances.stop] permissions applied in order to perform this operation.",
        "domain": "global",
        "reason": "conditionNotMet",
        "location": "If-Match",
        "locationType": "header"
      }
    ]
  }
}

EDIT: Update anonymization of service account name to make it clearer

melinath commented 6 months ago

@rhoriguchi it looks like in that example, you're impersonating the compute service account - the behavior I was speculating about was whether, if you are impersonating service-account@SOME-PROJECT.iam.gserviceaccount.com (like when you were using Terraform), you still get an error message about the compute service account. Could you try that & report the results?

rhoriguchi commented 6 months ago

@melinath sorry about that. While anonymizing the output I didn't keep the 2 accounts different. Please take a look at my previous comment I've updated it https://github.com/hashicorp/terraform-provider-google/issues/17260#issuecomment-1978571105

melinath commented 6 months ago

Thanks! In that case, this doesn't seem to be a bug in the Terraform provider, just a thing about how the API works.

roaks3 commented 6 months ago

Considering this a feature request for the service team to review. It seems that the provider is working as expected, but the configured service account is not used, which can cause unexpected behavior when working across multiple projects.

timwsuqld commented 6 months ago

After spending 30 minutes with this same issue of permissions, it became clear to me that the google_compute_default_service_account is not the SA that we actually need, but the "Compute Engine Service Agent" which has the form of service-PROJECT_NUMBER@compute-system.iam.gserviceaccount.com

Ideally we need a data.google_compute_engine_service_agent source to get the right service account, especially as it sounds so much like the "Compute Default Service Account", this is likely to cause confusion. (Thanks Google).

hervedevos commented 5 months ago

After spending 30 minutes with this same issue of permissions, it became clear to me that the google_compute_default_service_account is not the SA that we actually need, but the "Compute Engine Service Agent" which has the form of service-PROJECT_NUMBER@compute-system.iam.gserviceaccount.com

Ideally we need a data.google_compute_engine_service_agent source to get the right service account, especially as it sounds so much like the "Compute Default Service Account", this is likely to cause confusion. (Thanks Google).

I encountered this same problem and came to the same conclusion. Aligned that this is quite unclear

pspot2 commented 3 months ago

After spending 30 minutes with this same issue of permissions, it became clear to me that the google_compute_default_service_account is not the SA that we actually need, but the "Compute Engine Service Agent" which has the form of service-PROJECT_NUMBER@compute-system.iam.gserviceaccount.com

Ideally we need a data.google_compute_engine_service_agent source to get the right service account, especially as it sounds so much like the "Compute Default Service Account", this is likely to cause confusion. (Thanks Google).

I also ran into this issue while trying to launch a GCE instance by resizing the respective MIG. The MIG uses an instance template that applies a KMS CMEK from a different project for encrypting the boot disk.

The instance launch fails with a KMS permission-related error message, which is completely misleading: while Cloud Logging says that the principal <project_number>@cloudservices.gserviceaccount.com is missing KMS key permissions, granting those permissions to that principal changes nothing. Also granting those permissions to the current "Compute Default Service Account" (e.g. obtained by the current data source) changes nothing. The problem is resolved by granting permissions to service-PROJECT_NUMBER@compute-system.iam.gserviceaccount.com.

All 3 types of default service accounts are described here: click. In a nutshell, there is:

Now, while it is perfectly possible to statically construct the service agent string, a shortcut (something like data.google_compute_default_service_agent) would, of course, be much nicer.

Additional info: click