Open michaelmdresser opened 2 years ago
@michaelmdresser We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?
@michaelmdresser One thing I wanted to point out here is that this is important to get this right for multi-zone clusters (we need the zone for the "master"). I didn't know about cluster-location
but that seems like it could be adequate. I'm pretty sure that v1/instance/zone
just gives you the local zone (relative to the node/pod).
I think the concealment of kube-env
is because of this exploit, so we'll definitely want to find a way around this
And yes, it looks like cluster-location
is likely what we want: https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity
I think this is one of those things that if concealment is enabled, then we can expect the new version of workload-identity. So maybe we can fallback to the "new" approach if the old approach doesn't work.
Thanks for the extra digging Bolt! Should be super helpful for a fix.
We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?
@Adam-Stack-PM I haven't had time to investigate the impact, so I can't give it a priority. This may be most GKE clusters (high priority, probably 1.97) or very few GKE clusters (on the fence about 1.97 vs. 1.98).
@michaelmdresser, Thanks for the context here. I am labeling it P1 for now with a requirement to understand the impact before releasing v1.97.
I am facing the exact same issue. I noticed that the identified fixed has been removed from 1.97 Was there a reason to remove this from the scope of 1.97 ? Any impact identified ?
Thanks a lot for your support on this.
Observed problem
Turndown fails to run on a GKE cluster with the following config info:
Logs from user environment:
Source of the error in code
This "Loading node pools" message, followed by the error comes from here in the GKE provider. https://github.com/kubecost/cluster-turndown/blob/c74e3bbb004805d5821cc0381225f7b55009a58a/pkg/cluster/provider/gkeclusterprovider.go#L169-L183
The request being executed uses a path generator which is filling in the empty zone string, causing the error. https://github.com/kubecost/cluster-turndown/blob/c74e3bbb004805d5821cc0381225f7b55009a58a/pkg/cluster/provider/gkeclusterprovider.go#L496-L500
We're using
md.client.InstanceAttributeValue("kube-env")
to get the GCP zone/locaion: https://github.com/kubecost/cluster-turndown/blob/c74e3bbb004805d5821cc0381225f7b55009a58a/pkg/cluster/provider/gkemetadata.go#L84-L94Possible cause
This may not be caused by the absence of
kube-env
metadata, but rather a lack of access to it. GKE offers "metadata concealment" which specifically calls outkube-env
as data to be hidden.kube-env
is also mentioned in GKE's NodeMetadata config "SECURE" setting.Possible solution
Reporting user has suggested a different attribute value to use:
cluster-location
If this is a stable attribute provided by GKE-provisioned VMs this probably works. We could also investigate using
v1/instance/zone
as an alternative. It seems to be officially guaranteed on all GCP VMs. Other stable sources of node(pool) location information may be preferable, I just haven't dug deep enough to find them yet.Other considerations
It is currently unclear if this is affecting all GKE environments or those of only a certain version, region, or configuration (e.g. metadata concealment). Any fixes here should be tested on earlier GKE versions to ensure compatibility.