Misleading error when attaching a persistent disk from a different project

vit-zikmund commented 1 year ago

Hi there, just got out of one nice rabbit-hole and thought I won't keep the journey for myself :)

TL;DR

Now I know attaching a disk in one project as a PersistentVolume in a GKE in another project is not allowed, but the driver gave me a real hard time on my way to realize that - which is what I want to improve for any other unfortunate dead-end journeymen.

The rabbit hole

Trust me, it was a well meant design decision. We decided to keep some persistent disks in their own project (let's call it _DISKPROJECT) out of a scope of an ephemeral project hosting a GKE cluster (_GKEPROJECT). So far I haven't tripped any warning that it's not really supposed to work; all there is speaks about that one can't attach disks from different zones, but that was no problem. We set it up per the instructions in the GKE docs. All seemed fine until we wanted to bind the PV/PVC to a workload when things started sliding the wrong direction without us knowing yet (the logs are from the workload's namespace events).

Warning FailedAttachVolume 12s (x8 over 2m24s) attachdetach-controller AttachVolume.Attach failed for volume "_PERSISTENT_VOLUMENAME" : rpc error: code = Internal desc = Failed to getDisk: googleapi: Error 403: Required 'compute.disks.get' permission for 'projects/_DISKPROJECT/zones/ZONE/disks/_DISKNAME', forbidden

Treading that path for the 1st time, it wasn't particularly easy to figure out that the permissions should be added to the "hidden" engine robot account, but we cracked that one.

Curious and cautious about what all the permissions will be required we naturally hit a couple new ones right after we added the previous ones, notably:

37s Warning FailedAttachVolume pod/POD AttachVolume.Attach failed for volume "_PERSISTENT_VOLUMENAME" : rpc error: code = Internal desc = Failed to Attach: failed cloud service attach disk call: googleapi: Error 403: Required 'compute.instances.attachDisk' permission for 'projects/_DISKPROJECT/zones/ZONE/instances/_GKENODE' More details: Reason: forbidden, Message: Required 'compute.instances.attachDisk' permission for 'projects/_DISKPROJECT/zones/ZONE/instances/_GKENODE' Reason: forbidden, Message: Required 'compute.disks.use' permission for 'projects/_DISKPROJECT/zones/ZONE/disks/_DISKNAME'

Something's already off in this one - it's mentioning the _GKENODE in the _DISKPROJECT, while the node is actually in its own "_GKEPROJECT". But we didn't notice, so out of a little desperation we added the _DISKPROJECT's compute.admin role to the _GKEPROJECT's robot account, at which moment the "403 Forbidden" errors disappeared, but the most strange (while now more obvious) thing happened. The error became a "404 Not Found", because the GCP API couldn't find the node, that didn't exist in the _DISKPROJECT, but existed in the _GKEPROJECT:

14s Warning FailedAttachVolume pod/POD AttachVolume.Attach failed for volume "_PERSISTENT_VOLUMENAME" : rpc error: code = Internal desc = Failed to Attach: failed cloud service attach disk call: googleapi: Error 404: The resource 'projects/_DISKPROJECT/zones/ZONE/instances/_GKENODE' was not found, notFound

That was a dead end and at this moment I grew suspicious the driver (v1.8.7 in our case, but master seems to have it too) "incorrectly" derives the node's project from the disk's one. Then I really found the code that extracts the project from the disk's volumeHandle and uses it in the call to AttachDisk which eventually leads to the GCP API call that produces the error. Ha!

Before filing a bug for this "obviously" incorrect assumption, I tried to bypass the driver to see if I'll have more luck directly with the API's attachDisk (redacted):

curl --request POST \
  'https://compute.googleapis.com/compute/v1/projects/$GKE_PROJECT/zones/$ZONE/instances/$GKE_NODE/attachDisk
  --data '{"deviceName":"$DISK_NAME","source":"projects/$DISK_PROJECT/zones/$ZONE/disks/$DISK_NAME"}'

Finally it gave out the error I was missing the whole time:

Invalid value for field 'resource.source': 'projects/_DISKPROJECT/zones/ZONE/disks/_DISKNAME'. Disk must be in the same project as the instance or instance template.

Whoops, OK then!

Next steps

Given the level the driver wants to enforce that rule itself and not delegate it to the API, there are likely two scenarios how to cope with a fix.

The "same project" rule is a solid one, driver should warn about such misconfigs right away. Given the driver already derives the project from the disk's handle, the assumption is there. However, nothing warns anyone before starting with the GCP API calls (which leads one on the - hopefully very unnecessary - permission path with a dead end). An obvious and easy fix would be to add the check to validateControllerPublishVolumeRequest
The driver should delegate the responsibility to the API (no, please don't! ;)) However nasty, it can still be a legitimate way to let things flow. In such case, the node's project should enter AttachDisk as yet another instanceProject argument (or whatever) so it would get passed correctly to the API so the user would get the final error.

Either way, now the driver hides the last most important API error behind its silent assumption, which is hopefully worth tackling.

Thanks for reading this novel up to here :tada: :smile:

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / gcp-compute-persistent-disk-csi-driver