kubernetes / cloud-provider-openstack

Apache License 2.0
619 stars 610 forks source link

[cinder-csi-plugin] Detaching a volume on controller has too much of a delay #2661

Open CallMeFoxie opened 1 month ago

CallMeFoxie commented 1 month ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened: Whenever we have maintenance in Openstack we do a graceful shutdown on a node. That triggers draining the pods. However detaching the cinder volume gets noticed by the controllerplugin too late in the stack.

I0917 15:21:34.010628       1 csi_handler.go:234] Error processing "csi-045061d9a5678eed8e0d57a59e01420bb5e13899386cc6581378d5558301b39e": failed to detach: rpc error: code = Internal desc = ControllerUnpublishVolume Detach Volume failed with error failed to detach volume cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498 from compute e7eda096-1267-45a1-9164-f7a7d85c7f4e : Expected HTTP response code [202 204] when accessing [DELETE https://api.ouropenst.ack:8774/v2.1/servers/e7eda096-1267-45a1-9164-f7a7d85c7f4e/os-volume_attachments/cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498], but got 409 instead: {"conflictingRequest": {"code": 409, "message": "Cannot 'detach_volume' instance e7eda096-1267-45a1-9164-f7a7d85c7f4e while it is in task_state reboot_started"}}

which means that while the server is in maintenance we cannot re-attach our volumes.

I0917 15:21:34.429151       1 csi_handler.go:251] Attaching "csi-ae19a426cd7dbc8de74de71d711bc63ae0e71e6c28a0238785e84338ad3beb6c"
I0917 15:21:34.599180       1 csi_handler.go:234] Error processing "csi-ae19a426cd7dbc8de74de71d711bc63ae0e71e6c28a0238785e84338ad3beb6c": failed to attach: rpc error: code = Internal desc = [ControllerPublishVolume] Attach Volume failed with error failed to attach cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498 volume to b3d51121-6c84-4670-a4a7-e83729d2004a compute: Expected HTTP response code [200] when accessing [POST https://api.ouropenst.ack:8774/v2.1/servers/b3d51121-6c84-4670-a4a7-e83729d2004a/os-volume_attachments], but got 400 instead: {"badRequest": {"code": 400, "message": "Invalid volume: volume cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498 is already attached to instances: e7eda096-1267-45a1-9164-f7a7d85c7f4e"}}

What you expected to happen: Unmount the volume before node goes into reboot

How to reproduce it: do a graceful shutdown in a cluster

Anything else we need to know?: don't hink so

Environment:

CallMeFoxie commented 1 month ago

Actually it might be problem in the whole idea of ACPI shutdown - OpenStack seems to set the reboot_started flag rightaway as a user presses ctrl-alt-del / shutdown / whatever button in horizon (or API) so by the time the pods get drained off the node (and volumes unmounted) the OpenStack no longer accepts any volume detachments.