kubecost / cluster-turndown

Automated turndown of Kubernetes clusters on specific schedules.
Apache License 2.0
259 stars 23 forks source link

Turndown fails on GKE due to empty zone string #52

Open michaelmdresser opened 2 years ago

michaelmdresser commented 2 years ago

Observed problem

Turndown fails to run on a GKE cluster with the following config info:

Logs from user environment:

I0706 21:00:43.033004       1 main.go:118] Running Kubecost Turndown on: REDACTED
I0706 21:00:43.059698       1 validator.go:41] Validating Provider...
I0706 21:00:43.061743       1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063220       1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063445       1 namedlogger.go:24] [GKEClusterProvider] Loading node pools for: [ProjectID: REDACTED, Zone: , ClusterID: REDACTED]
I0706 21:00:43.192046       1 validator.go:27] [Error]: Failed to load node groups: rpc error: code = InvalidArgument desc = Location "" does not exist.

Source of the error in code

This "Loading node pools" message, followed by the error comes from here in the GKE provider. https://github.com/kubecost/cluster-turndown/blob/c74e3bbb004805d5821cc0381225f7b55009a58a/pkg/cluster/provider/gkeclusterprovider.go#L169-L183

The request being executed uses a path generator which is filling in the empty zone string, causing the error. https://github.com/kubecost/cluster-turndown/blob/c74e3bbb004805d5821cc0381225f7b55009a58a/pkg/cluster/provider/gkeclusterprovider.go#L496-L500

We're using md.client.InstanceAttributeValue("kube-env") to get the GCP zone/locaion: https://github.com/kubecost/cluster-turndown/blob/c74e3bbb004805d5821cc0381225f7b55009a58a/pkg/cluster/provider/gkemetadata.go#L84-L94

Possible cause

This may not be caused by the absence of kube-env metadata, but rather a lack of access to it. GKE offers "metadata concealment" which specifically calls out kube-env as data to be hidden. kube-env is also mentioned in GKE's NodeMetadata config "SECURE" setting.

Possible solution

Reporting user has suggested a different attribute value to use: cluster-location

curl -L -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-location
europe-west2

If this is a stable attribute provided by GKE-provisioned VMs this probably works. We could also investigate using v1/instance/zone as an alternative. It seems to be officially guaranteed on all GCP VMs. Other stable sources of node(pool) location information may be preferable, I just haven't dug deep enough to find them yet.

Other considerations

It is currently unclear if this is affecting all GKE environments or those of only a certain version, region, or configuration (e.g. metadata concealment). Any fixes here should be tested on earlier GKE versions to ensure compatibility.

Adam-Stack-PM commented 2 years ago

@michaelmdresser We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?

mbolt35 commented 2 years ago

@michaelmdresser One thing I wanted to point out here is that this is important to get this right for multi-zone clusters (we need the zone for the "master"). I didn't know about cluster-location but that seems like it could be adequate. I'm pretty sure that v1/instance/zone just gives you the local zone (relative to the node/pod).

mbolt35 commented 2 years ago

I think the concealment of kube-env is because of this exploit, so we'll definitely want to find a way around this

mbolt35 commented 2 years ago

And yes, it looks like cluster-location is likely what we want: https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity

mbolt35 commented 2 years ago

I think this is one of those things that if concealment is enabled, then we can expect the new version of workload-identity. So maybe we can fallback to the "new" approach if the old approach doesn't work.

michaelmdresser commented 2 years ago

Thanks for the extra digging Bolt! Should be super helpful for a fix.

We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?

@Adam-Stack-PM I haven't had time to investigate the impact, so I can't give it a priority. This may be most GKE clusters (high priority, probably 1.97) or very few GKE clusters (on the fence about 1.97 vs. 1.98).

Adam-Stack-PM commented 2 years ago

@michaelmdresser, Thanks for the context here. I am labeling it P1 for now with a requirement to understand the impact before releasing v1.97.

remithomasn7 commented 2 months ago

I am facing the exact same issue. I noticed that the identified fixed has been removed from 1.97 Was there a reason to remove this from the scope of 1.97 ? Any impact identified ?

Thanks a lot for your support on this.