pulumi / pulumi-gcp

A Google Cloud Platform (GCP) Pulumi resource package, providing multi-language access to GCP
Apache License 2.0
177 stars 51 forks source link

GKE cluster gets created with default service account even though I specified a different one #2124

Open solomonshorser opened 1 month ago

solomonshorser commented 1 month ago

Describe what happened

I tried to specify a non-default service account for GKE and when the cluster creation process finished, I noticed that the service account was "default".

Sample program

this.cluster = new gcp.container.Cluster("my_cluster", {
            enableAutopilot: true,
            deletionProtection: false,
            minMasterVersion: config.jobRunner.clusterVersion,
            location: config.region,
            project: config.commonProject.id,
            name: config.jobRunner.cluster.name,
            network: config.airflowVpcNetworkId,
            subnetwork: config.airflowSubnetId,
            privateClusterConfig: {
                enablePrivateEndpoint: false,
                enablePrivateNodes: true,
                masterIpv4CidrBlock: config.jobRunner.masterIpv4CidrBlock,
            },
            ipAllocationPolicy: {
                clusterSecondaryRangeName: config.jobRunner.clusterSecondaryRangeName,
                servicesSecondaryRangeName: config.jobRunner.servicesSecondaryRangeName,
             },
            nodeConfig: {
                serviceAccount: config.jobRunner.serviceAccount,
            },
            releaseChannel: {
                channel: 'STABLE',
            },
        }, {
            dependsOn: dependencies,
            ignoreChanges: ['verticalPodAutoscaling'] 
        });

Log output

When I run pulumi preview --diff I see this in the preview diff, even if I haven't modified the value of the service account name:

    +-gcp:container/cluster:Cluster: (replace)
        [id=...]
        [urn=...]
        [provider=...::pulumi:providers:gcp::default_7_29_0::7b8406f6-4f36-49f1-b75d-51285f41ca15]
      ~ nodeConfig: {
          - reservationAffinity: {
              - consumeReservationType: "NO_RESERVATION"
              - key                   : ""
              - values                : []
            }
          - reservationAffinity: {
              - consumeReservationType: "NO_RESERVATION"
              - key                   : ""
              - values                : []
            }
          ~ serviceAccount     : "default" => "my-service-account@gcp-project.iam.gserviceaccount.com"
        }

Affected Resource(s)

No response

Output of pulumi about

CLI
Version      3.120.0
Go Version   go1.22.4
Go Compiler  gc

Plugins
KIND      NAME           VERSION
resource  command        0.9.2
resource  gcp            7.29.0
resource  gcp            5.26.0
resource  google-native  0.32.0
resource  google-native  0.26.1
resource  kubernetes     4.13.1
resource  kubernetes     3.30.2
resource  kubernetes     3.30.2
resource  kubernetes     3.30.2
language  nodejs         unknown
resource  random         4.2.0

Host
OS       darwin
Version  13.6.7
Arch     x86_64

This project is written in nodejs: executable='/usr/local/bin/node' version='v20.3.0'

Dependencies:
NAME                    VERSION
@pulumi/command         0.9.2
@pulumi/kubernetes      4.13.1
@pulumi/pulumi          3.95.0
@pulumi/random          4.2.0
@types/node             10.17.60
@pulumi/gcp             7.29.0
@pulumi/google-native   0.32.0

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

guineveresaenger commented 4 weeks ago

Hi @solomonshorser - thank you for filing this issue.

It is possible your configured service account does not have the correct permissions to create a cluster? This might be related to account issues discussed in https://github.com/hashicorp/terraform-provider-google/issues/17252. Is this a new pulumi program, or are you attempting to update an existing stack?

Would it be possible for you to send us code for a minimal complete reproduction of this bug? It would help us address this issue much faster.

As well, if you could supply us with the output of export PULUMI_DEBUG_GRPC=logs.jsonl pulumi up that would be great.

solomonshorser commented 3 weeks ago

Hi @guineveresaenger, Can you please clarify: which service account needs to have permission to create a cluster? The account which runs pulumi has permission because the cluster does get created, but with a default service account despite me trying to specify a different one.

OR... is it the one I specify here?

            nodeConfig: {
                serviceAccount: config.jobRunner.serviceAccount,
            },

Because the GCP default service account doesn't have permission to create clusters so I'm not sure why this one would need it. Also, we used non-default service accounts for GCP Composer and had no issues there with this sort of permission.

I can change permissions but in this particular project I don't have IAM access so I will need to submit a request to do this, and I will need to justify giving this service account permission to create clusters - is there a document that you know of which I could reference in such a request?

It's possible that this is related to the terraform issue you linked...

The stack is an older stack that I tried to update: changing the cluster's service account requires the old cluster be deleted and the new one created as a replacement.

I'll try to get a simple example program with debug output to you soon.

guineveresaenger commented 3 weeks ago

Hi @solomonshorser - I was talking about the config.jobRunner.serviceAccount you're specifying. It sounds like that if that service account (my-service-account@gcp-project.iam.gserviceaccount.com in your example) doesn't have the right permissions, then maybe there's some sort of buggy fallback behavior that means you get the cluster, but it's configured with the default service account.

It does seem like you're running into ~the same~ a similar problem as https://github.com/hashicorp/terraform-provider-google/issues/17252. An example program would be great - thank you!

solomonshorser commented 3 weeks ago

@guineveresaenger That behaviour sounds very buggy! It does not have cluster-creation permission, and it should not have that permissions. I might try granting it, temporarily, just to see if the process works, though that solution won't be allowed in non-sandbox projects.

When I get a chance, I'll write an example program, but honestly, I don't think it will have much more than the snippet above (except hard-coded values instead of values from config).

solomonshorser commented 3 weeks ago

I modified my config to look like this:

new gcp.container.Cluster(name,
  {
    enableAutopilot: true,
    // ...
    nodeConfig: {
      serviceAccount: gkeNodeServiceAccount.email
    },
    clusterAutoscaling: {
      autoProvisioningDefaults: {
        serviceAccount: gkeNodeServiceAccount.email
      }
    }
  }
);

based on a suggestion here: https://github.com/pulumi/pulumi/discussions/15902#discussioncomment-9976698 And so far, it looks like it might have worked! Will report back later to confirm.

VenelinMartinov commented 2 weeks ago

Thanks for reporting back here @solomonshorser!

For anyone else who hits this it looks like the serviceAccount parameter on the cluster resource needs the service account email, not the ID to work properly.

If not specified correctly the provider falls back to the default service account.

solomonshorser commented 2 weeks ago

Ok I ran it again today in a better environment for this test and the above code snippet seems to work. The difference wasn't that I used the email address instead of ID - I was already doing that long ago. The difference was including:

clusterAutoscaling: {
      autoProvisioningDefaults: {
        serviceAccount: gkeNodeServiceAccount.email
      }
    }

Previously, we did not include any special config under clusterAutoscaling.

It seems strange to specify the service account twice. I'm not 100% sure if I need to keep specifying the service account under nodeConfig if it's now under clusterAutoscaling, but I don't have time now to test that permuation.

VenelinMartinov commented 2 weeks ago

Ok this looks like an upstream issue with the GCP API. @solomonshorser has found the correct workaround suggested by the google team.

TF Issue: https://github.com/hashicorp/terraform-provider-google/issues/9505

GCP: https://issuetracker.google.com/issues/219237911?pli=1#comment3

Seems like the upstream issues were closed wrongly but people are still reporting problems with it.

The issue is specific to autopilot clusters.

For anyone affected, the workaround, as @solomonshorser found correctly, thank you, is to specify a service account in the autoProvisioningDefaults for the cluster:

clusterAutoscaling: {
  autoProvisioningDefaults: {
    serviceAccount: gkeNodeServiceAccount.email
  }
}
solomonshorser commented 2 weeks ago

Ok this looks like an upstream issue with the GCP API. @solomonshorser has found the correct workaround suggested by the google team.

TF Issue: hashicorp/terraform-provider-google#9505

GCP: https://issuetracker.google.com/issues/219237911?pli=1#comment3

Seems like the upstream issues were closed wrongly but people are still reporting problems with it.

The issue is specific to autopilot clusters.

For anyone affected, the workaround, as @solomonshorser found correctly, thank you, is to specify a service account in the autoProvisioningDefaults for the cluster:

clusterAutoscaling: {
  autoProvisioningDefaults: {
    serviceAccount: gkeNodeServiceAccount.email
  }
}

Thanks, but real credit goes to @hayesgm who posted the suggestion here: https://github.com/pulumi/pulumi/discussions/15902#discussioncomment-9976698 ;)