kubernetes-sigs / aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/
Apache License 2.0
996 stars 797 forks source link

Windows node - VolumeAttachment stuck #2100

Open weretheone opened 3 months ago

weretheone commented 3 months ago

/kind bug

What happened? In a mixed cluster, on a Windows node VolumeAttachment stuck for hours with detach error:

Detach Error: Message: rpc error: code = DeadlineExceeded desc = context deadline exceeded Time: 2024-07-29T06:49:42Z

We never had issue on a Linux based nodes, but on Windows nodes after a certain run time we face the above issue. Usually it happens after like 2-3 days of uptime despite the daily workload (and the operation volume of attach - detach) is identical every day.

The volume state is "In-use" on AWS UI and this can be found in the logs:

ebs-plugin I0729 07:26:42.701157 1 controller.go:463] "ControllerUnpublishVolume: detaching" volumeID="vol-05adad3e3cf7854ab" nodeID="i-0ec4abe6f62602eaa" ebs-plugin I0729 07:26:44.476134 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" ebs-plugin I0729 07:26:46.544284 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" ebs-plugin I0729 07:26:49.419443 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" ebs-plugin I0729 07:26:53.726079 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" ebs-plugin I0729 07:27:00.630341 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" ebs-plugin I0729 07:27:12.233899 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" ebs-plugin I0729 07:27:32.238823 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" ebs-plugin E0729 07:27:42.701238 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume \"vol-05adad3e3cf7854ab\" from node \"i-0ec4abe6f62602eaa\": context canceled" csi-attacher I0729 07:27:42.709052 1 csi_handler.go:234] Error processing "csi-37d0378ea70f8f66ea0687a4b86524b8e89182f80d29903bfcb9056f83393306": failed to detach: rpc error: code = DeadlineExceeded desc = context deadline exceeded

ebs-plugin E0728 19:59:48.121284 7608 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdaa. disk number for device path \"/dev/xvdaa\" volume id \"vol-05adad3e3cf7854ab\" not found" ebs-plugin E0728 20:01:52.276644 7608 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdaa. disk number for device path \"/dev/xvdaa\" volume id \"vol-05adad3e3cf7854ab\" not found" ebs-plugin E0728 20:03:56.517699 7608 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdaa. disk number for device path \"/dev/xvdaa\" volume id \"vol-05adad3e3cf7854ab\" not found" ebs-plugin E0728 20:06:00.523852 7608 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdaa. disk number for device path \"/dev/xvdaa\" volume id \"vol-05adad3e3cf7854ab\" not found"

This issue is with us for years and we was only able to mitigate the problem with the daily restart of windows machines, which somewhat solved the problem, but it would be nice to resolve the root cause.

What you expected to happen? Volumes get detached correctly.

How to reproduce it (as minimally and precisely as possible)? Configure and startup the driver with option enableWindows: true on mixed cluster on a windows node. Define the Storage class and PVC for your workload. We use Argo Workflows to schedule jobs which utilize the defined PVCs and attach volumes to pods. The pods execute activities and upon completion it detaches. After a few day of correct execution the communication between the
controller which runs on a linux machine and the ebs-csi-node-windows pod which runs on a windows machine will experience communication problem, which later result in volumes gets stuck.

Anything else we need to know?: We tried various debug methods and read trough a ton of issues connected to gRPC problems, but we was not able to pinpoint the root cause of the problem. Debugging is extremely hard as after a node restart it can go for days without problem, so if you can provide any debug option that would be really nice.

Environment

AndrewSirenko commented 3 months ago

This issue is with us for years and we was only able to mitigate the problem with the daily restart of windows machines, which somewhat solved the problem, but it would be nice to resolve the root cause.

Thank you for filing this issue, this daily restart really should not be needed. I will try to reproduce this issue this week with a similar cron job.

Debugging is extremely hard as after a node restart it can go for days without problem, so if you can provide any debug option that would be really nice.

What kind of debug option would have been helpful for you?

/priority important-soon

AndrewSirenko commented 3 months ago

Also would you consider filing a customer support request with AWS about this? Going through AWS support would make it easier for EBS to collect more details as to why the volume is not detaching.

Especially given the comment The volume state is "In-use" on AWS UI, which makes it seem that the driver's EC2 DetachVolume call asynchronously succeeds, but the volume does not detach at the AWS level, which could be why we see the failed to detach: rpc error: code = DeadlineExceeded desc = context deadline exceeded error.

Thank you

AndrewSirenko commented 3 months ago

Also the following AWS FAQ item might be helpful to understand root cause: Resolve an EBS volume stuck in the detaching state | AWS re:Post

The log message: "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached" indicates that the DetachVolume operation is delayed due to this busy volume state. Which necessitates either an unmount of volume or a reboot of instance.

This is strange because the csi driver node pod should have already succeeded in unmounting volume by the time csi controller pod is calling EC2 DetachVolume here. I will continue to investigate issue.

Thank you.

weretheone commented 3 months ago

@AndrewSirenko thanks a lot for jumping right into this topic. Hope you will get lucky reproducing the problem.

AndrewSirenko commented 3 months ago

For debug options, I think anything which shows what calls were made to AWS and at what time I think would be useful.

I believe you can already see what calls are made to AWS with controller.sdkDebugLog=true in our helm chart here or the EKS add-on. (I.e. add --set controller.sdkDebugLog=true to your helm upgrade command). Be warned this will print the entire HTTP Request to the logs, which may include sensitive info (see redacted example below)

You can also get more detailed CSI Driver node pod logs with node.logLevel=7, which would help us see if anything weird happened during the volume's unmount in NodeUnpublish NodeUnstage RPC calls that must succeed before ControllerUnpublishVolume.

We do not have any support plan at the moment with AWS

Understood. If/when we reproduce the issue then I can file a support ticket myself.

In the meantime while we try to reproduce on our end, if you could send any high verbosity (node.logLevel=7) node pod logs from the relevant node (or even kubelet logs), it would help us see whether there was an issue with the CSI Driver unmount operation that might have caused the EC2 DetachVolume API call to get stuck in busy. Thank you!


SDK 2024/07/30 13:24:44 DEBUG Request
POST / HTTP/1.1
... REDACTED headers
X-Amz-Security-Token: REDACTED

Action=DetachVolume&InstanceId=i-04ac&Version=2016-11-15&VolumeId=vol-0e
SDK 2024/07/30 13:24:44 DEBUG Response
HTTP/1.1 200 OK
Content-Length: 412
Cache-Control: no-cache, no-store
Content-Type: text/xml;charset=UTF-8
Date: Tue, 30 Jul 2024 13:24:44 GMT
Server: AmazonEC2
Strict-Transport-Security: max-age=31536000; includeSubDomains

<?xml version="1.0" encoding="UTF-8"?>
<DetachVolumeResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
    <requestId>f178c6f1-</requestId>
    <volumeId>vol-</volumeId>
    <instanceId>i-</instanceId>
    <device>/dev/xvdaa</device>
    <status>detaching</status>
    <attachTime>2024-07-30T01:17:34.000Z</attachTime>
</DetachVolumeResponse>
weretheone commented 3 months ago

Thanks for the details, I was not aware that you can set separate loglevel for controller and node. I set up logLevel=7 on our dev environment where the issue happens rarely. (I guess because we do much less attach-detach activity compared to production)

I have to observe this and see how much log it produces, and what exactly pushed into the output before I set this up on production where the issue happens with much faster reproduction rate. Let you know when I succeed with it.

weretheone commented 3 months ago

The volume is not yet stuck on the windows node, but here is what you can se already in the node pod's log. FYI these are still from production where the log level not yet set higher.

liveness-probe W0730 08:48:49.736097 6692 connection.go:234] Still connecting to unix:/csi/csi.sock liveness-probe W0730 08:48:59.622531 6692 connection.go:234] Still connecting to unix:/csi/csi.sock ebs-plugin I0730 08:49:13.590787 4200 ec2.go:40] "Retrieving EC2 instance identity metadata" regionFromSession="" ebs-plugin I0730 08:49:18.452961 4200 metadata.go:48] "ec2 metadata is available" ebs-plugin I0730 08:49:18.469549 4200 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.30.0" ebs-plugin I0730 08:49:18.469549 4200 node.go:97] "regionFromSession Node service" region="" node-driver-registrar I0730 08:47:15.681951 14716 main.go:135] Version: v2.10.1 node-driver-registrar I0730 08:47:19.551009 14716 main.go:136] Running node-driver-registrar in mode= node-driver-registrar I0730 08:47:19.551009 14716 main.go:157] Attempting to open a gRPC connection with: "unix:/csi/csi.sock" node-driver-registrar I0730 08:47:21.226395 14716 main.go:164] Calling CSI driver to discover driver name node-driver-registrar I0730 08:47:22.328214 14716 main.go:173] CSI driver name: "ebs.csi.aws.com" node-driver-registrar I0730 08:47:22.363838 14716 node_register.go:55] Starting Registration Server at: /registration/ebs.csi.aws.com-reg.sock node-driver-registrar I0730 08:47:23.068482 14716 node_register.go:64] Registration Server started at: /registration/ebs.csi.aws.com-reg.sock node-driver-registrar I0730 08:47:23.084280 14716 node_register.go:88] Skipping HTTP server because endpoint is set to: "" node-driver-registrar I0730 08:47:23.961337 14716 main.go:90] Received GetInfo call: &InfoRequest{} node-driver-registrar I0730 08:47:24.347512 14716 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,} liveness-probe W0730 08:49:09.708494 6692 connection.go:234] Still connecting to unix:/csi/csi.sock liveness-probe W0730 08:49:19.599337 6692 connection.go:234] Still connecting to unix:/csi/csi.sock liveness-probe I0730 08:49:20.045863 6692 main.go:133] "Calling CSI driver to discover driver name" liveness-probe I0730 08:49:23.012409 6692 main.go:141] "CSI driver name" driver="ebs.csi.aws.com" liveness-probe I0730 08:49:23.012409 6692 main.go:170] "ServeMux listening" address="0.0.0.0:9808" liveness-probe E0730 08:49:29.990645 6692 main.go:67] "Failed to establish connection to CSI driver" err="context deadline exceeded" ebs-plugin I0730 08:49:21.171044 4200 node.go:940] "CSINode Allocatable value is set" nodeName="ip----.us-west-2.compute.internal" count=27 ebs-plugin E0730 08:52:16.938386 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:18.761474 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:21.621097 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:27.253570 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:37.018713 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:53.618534 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:53:26.430423 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:54:31.468045 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:56:34.111954 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 09:12:17.726274 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:12:20.109665 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:12:23.292616 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:14:16.412201 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:14:24.441689 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:32:18.601399 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress" ebs-plugin E0730 09:32:20.280498 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress" ebs-plugin E0730 09:32:22.946024 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress" ebs-plugin E0730 09:32:27.456541 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress"

The Controller log shows normal activities if I search for the volume number:

ebs-plugin I0731 07:30:09.377907 1 controller.go:422] "ControllerPublishVolume: attached" volumeID="vol-05adad3e3cf7854ab" nodeID="i-0fd306e921dd61994" devicePath="/dev/xvdaa" ebs-plugin I0731 07:31:13.067288 1 controller.go:463] "ControllerUnpublishVolume: detaching" volumeID="vol-05adad3e3cf7854ab" nodeID="i-XXXXXXXXXXXXXXX" ebs-plugin I0731 07:31:14.871563 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="detaching" desired="detached" ebs-plugin I0731 07:31:16.944004 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="detaching" desired="detached" ebs-plugin I0731 07:31:19.949348 1 controller.go:471] "ControllerUnpublishVolume: detached" volumeID="vol-05adad3e3cf7854ab" nodeID="i-XXXXXXXXXXXXXXX"

Still these GRPC errors looks problematic. Do you have any idea how can I debug that further to see which operations are in progress and contention?

AndrewSirenko commented 3 months ago

This error

ebs-plugin E0730 09:32:27.456541 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume="vol-05adad3e3cf7854ab" is already in progress"

Typically means that a node driver remote procedure call (e.g. NodeStageVolume or NodeUnpublishVolume, i.e. formatting, mounting, or unmounting volume) took so long that the kubelet gave up on listening to the original request. This can happen for example if an unmount takes > 10 minutes for some reason. In the background the node driver would keep doing the underlying system operation, but in the meantime will return operation ... is already in progress to any new RPCs so that driver doesn't do two simultaneous unmounts at once (unsafe).

Increased node debug logs might show you which operation was slow.

We can take an AI for making this error message more clear by atleast noting which gRPC call it was. But I don't think this is of concern for now.

Thanks for attaching these logs! Still no luck on my windows eks cluster but perhaps the weekend will bring the issue up locally.

weretheone commented 3 months ago

Hello @AndrewSirenko

the issue was not came up during the weekend, meanwhile the debug mode was switched on and as I see and understand the operations goes well both attach and detach. This was a successful execution of operations on the volume I had issue last time:

ebs-plugin I0805 08:30:19.194594   12216 node.go:116] "NodeStageVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","publish_context":{"devicePath":"/dev/xvdaa"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ntfs"}},"access_mode":{"mode":1}},"volume_context":{"storage.kubernetes.io/csiProvisionerIdentity":"1721215862357-6437-ebs.csi.aws.com"}}
ebs-plugin I0805 08:30:22.471163   12216 node.go:218] "NodeStageVolume: find device path" devicePath="/dev/xvdaa" source="1"
ebs-plugin I0805 08:30:23.047870   12216 node.go:246] "NodeStageVolume: checking if volume is already staged" device="" source="1" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:30:23.047870   12216 node.go:253] "NodeStageVolume: staging volume" source="1" volumeID="vol-05adad3e3cf7854ab" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount" fstype="ntfs"
ebs-plugin I0805 08:30:28.931169   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:30:39.033916   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:30:46.809994   12216 node.go:297] "NodeStageVolume: successfully staged volume" source="1" volumeID="vol-05adad3e3cf7854ab" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount" fstype="ntfs"
ebs-plugin I0805 08:30:46.809994   12216 node.go:195] "NodeStageVolume: volume operation finished" volumeID="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:46.811431   12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:47.247914   12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:47.315271   12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:47.658103   12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:47.819075   12216 node.go:416] "NodePublishVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","publish_context":{"devicePath":"/dev/xvdaa"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount","target_path":"c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ntfs"}},"access_mode":{"mode":1}},"volume_context":{"storage.kubernetes.io/csiProvisionerIdentity":"1721215862357-6437-ebs.csi.aws.com"}}
ebs-plugin I0805 08:30:47.821039   12216 mount_windows.go:75] "NodePublishVolume: removing dir" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"
ebs-plugin I0805 08:30:47.823230   12216 node.go:713] "[Debug] NodePublishVolume: Target path does not exist" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"
ebs-plugin I0805 08:30:47.823230   12216 node.go:757] "NodePublishVolume: mounting" source="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount" mountOptions=["bind"] fsType="ntfs"
ebs-plugin I0805 08:30:47.824150   12216 node.go:445] "NodePublishVolume: volume operation finished" volumeId="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:47.824150   12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:49.040433   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:30:54.865231   12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:54.870603   12216 node.go:499] "NodeGetVolumeStats: called" args={"volume_id":"vol-05adad3e3cf7854ab","volume_path":"c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"}
ebs-plugin I0805 08:30:59.060832   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:08.943760   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:18.877191   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:29.036613   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:39.023734   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:45.674123   12216 node.go:469] "NodeUnpublishVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","target_path":"c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"}
ebs-plugin I0805 08:31:45.674123   12216 node.go:489] "NodeUnpublishVolume: unmounting" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"
ebs-plugin I0805 08:31:45.675213   12216 node.go:485] "NodeUnPublishVolume: volume operation finished" volumeId="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:31:45.675213   12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:31:46.033755   12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:31:46.035954   12216 node.go:302] "NodeUnstageVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"}
ebs-plugin I0805 08:31:48.932973   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:49.451303   12216 safe_mounter_windows.go:234] "GetDeviceNameFromMount called" diskNumber=1 volumeID="\\\\?\\Volume{6b1309bd-a16a-41a3-8664-c20d40c32b9a}\\" mountPath="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:31:49.451303   12216 node.go:342] "NodeUnstageVolume: unmounting" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:31:58.891739   12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:32:02.016531   12216 safe_mounter_windows.go:138] "Successfully unmounted volume" diskNumber=1 volumeId="\\\\?\\Volume{6b1309bd-a16a-41a3-8664-c20d40c32b9a}\\" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:32:02.016531   12216 node.go:347] "NodeUnStageVolume: successfully unstaged volume" volumeID="vol-05adad3e3cf7854ab" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:32:02.016531   12216 node.go:317] "NodeUnStageVolume: volume operation finished" volumeID="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:32:02.016531   12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"

Sorry but as it looks I still have to wait.

weretheone commented 3 months ago

Today morning the issue occurred again and I've seen two volumes vol-0a959b8cf5f76a3eb and vol-05adad3e3cf7854ab stuck for hours.

For debug purpose I searched for the string "vol-05adad3e3cf7854ab" which I mentioned earlier in the ticket. This way I was able to pinpoint the exact time the issue happened. In the attached picture you can see that normal execution happened every hour yesterday until 3:30 when something happened and the volumeID logged continuously in the node's log for a 210 minutes. During this time the pod was active, then as we have an Argo activeDeadlineSeconds set for 210 minutes the pod terminated by Argo. After this time the controller's log shows increased amount of volume_id.

image

Checked the two timepoints to extract the logs around them. In the csi-attacher it seems that the last successful attachment was at 03:30 and detaching never called up until the pod was cancelled.

image

Other than this I only found messages I was already familiar with. What messages or strings I should look for?

AndrewSirenko commented 3 months ago

Ah thank you for the reproduction case.

What messages or strings I should look for?

Do you have any logs from the stuck hours on the driver node pod (kube-system, ebs-csi-node or ebs-csi-node-windows subsystem) near a NodeUnpublishVolume or NodeUnstageVolume message?

The node daemonset pod would give us a hint as to why the volume might not be detaching properly. Maybe some error message in the unmount that would give us clue.


From the logs I see above, I can only gather that controller pod cannot detach volume, which we already know. I assume this is because of that same busy issue from before.

image

From ^^ looks like there is a lot of activity in the node pod before the activity in controller pod, (compared to the usual quick spike of node pod logs followed by quick spike of controller logs as seen in left side of graph). Could you copy paste some of those logs?

weretheone commented 3 months ago

Looking into the logs, the last successful execution with "small spikes" was at 02:32 when both NodeUnpublishVolume and NodeUnstageVolume was called. Then I see no occurrence of this message anymore in the logs: image

So I searched for the NodeStageVolume message which is called first then immediately followed by this GRPC error and two info message that the volume operation finished:

E0806 01:30:18.301730 12216 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdab. disk number for device path \"/dev/xvdab\" volume id \"vol-05adad3e3cf7854ab I0806 01:30:18.301730 12216 node.go:195] "NodeStageVolume: volume operation finished" volumeID="vol-05adad3e3cf7854ab" I0806 01:30:18.301730 12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"

image Then rest of the messages from the node are continuous calls to NodeStageVolume-> GRPC error -> "volume operation finished" These messages continue until the pod gets terminated on deadline.

Meanwhile this can be seen on the controller: image

So as I understand the "stage" call goes trough, but because of some communication issue from the windows node it keeps calling and not able to register the success despite the node reporting: "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"

The controller is keeping the volume in the batch describe until the deadline cancellation of the pod, when it tries to detach the volume. image

Here we see an actual volume state of "detaching" for a minute then it becomes "busy" and stays like this forever.

AndrewSirenko commented 3 months ago

Hey @weretheone, thank you for reproducing and for your patience. This does not look like intended CSI Driver behavior, but I believe we are getting closer to a solution.

Your screenshots and timeline point to an issue in NodeStageVolume RPC never succeeding at the kubelet+ ebs csi driver node pod level, but the actual format+mount succeeding. This leads to NodeUnstageVolume never being called which means volume never gets unmounted, which would explain the busy volume state when we try to detach it (because EBS docs show that volume MUST be unmounted before it can be detached)

I've taken an action item to take a closer look at our NodeStageVolume window's code path. If the format + mount succeeded, there is no reason the gRPC error should keep occuring.


Can you help answer a few followups, given that I have yet to reproduce this on my Windows cluster?

  1. Does this timeline look right to you?
02:30: The last time `NodeUnpublishVolume`/`UnstageVolume` were called

03:30 `NodeStageVolume` was called, failed with `"GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdab. disk number for device path "/dev/xvdab" volume id "vol-05adad3e3cf7854ab`

03:30 - ??: NodeStageVolume keeps succeeding with "volume operation finished"

07:05: Pod cancellation -> Detach

07:10 until node reboot: ControllerUnpublish failures due to state `busy`
  1. Did the pod ever start running? Or was it always stuck in ContainerCreating. (Asking because the pod should never start running if NodeStageVolume RPC never succeeded`

  2. Can you dump the driver node and controller pod logs into a pastebin for us to look at? It's a bit hard to piece together these screenshots, and ControllerUnpublishVolume should never have been called if NodeUnstage NodeUnpublish were never called.

  3. Would you have access to Kubelet level logs for node? Or any instance logs that would show the actual mount format windows syscalls? If so can you attach them?

  4. Next time you catch a node that is stuck like this, would you be able to manually schedule another EBS-backed stateful workload on that cluster before the node is rebooted? I want to isolate this issue to either "NodeStageVolume doesn't work for whole node" or "Something is wrong with this particular volume / devicePath".

  5. Next time you catch a node that is stuck like this, what happens if you manually unmount the volume by ssh-ing into the node? If you manually unmount it, then the next detach volume call may succeed, and this would isolate the problem to NodeStageVolume not properly reporting a success.

  6. Can you share the AMI you are using if it is not private? So I can better mirror your workflow to reproduce? I assume I can also simulate your workflow by deploying an hourly stateful cronjob that writes and reads from EBS-backed mounted filesystem.

weretheone commented 3 months ago

Hi @AndrewSirenko,

for sure, I will try to answer your follow up questions.

  1. The timeline is correct
  2. The pod itself never starts to run
  3. For the Node and Controller logs I have tried to collect the data but the amount of messages are just too much. Given the period between 2:30 to 7:30 the messages contained the "vol-05adad3e3cf7854ab" is more than a thousand. Downloaded as CSV and sanitized but the size is still too much for a pastebin entry. Should we contact in e-mail or another method for this?
  4. With the current infrastructure setup I do not have access.
  5. The windows node handles currently 5 separate workflow with 5 EBS volume behind each. We never experienced all of the failing together. It usually starts with one volume stuck and then after a few hours the next one follows. For example now I have one which failed 28 hours ago and one which failed two hours ago.
  6. With the current setup I can't SSH into the node.
  7. ami-070cbfb3777087dd5 (Windows_Server-2022-English-Full-EKS_Optimized-1.29-2024.07.10) Yes the workload does just like this, every hour it attaches the required volume run the pod and detaches it.
AndrewSirenko commented 3 months ago

These followup answers are very helpful/

This looks like more of a "Kubernetes never realized that the volume was mounted" rather than a "Volume cannot detach" problem, which narrows down our search. The root cause of busy seems to be that the node driver pod never unmounts the volume (NodeUnstageVolume) because the NodeStageVolume RPC was never reported as successful to Kubelet.

Regarding email, mine is "andrewsirenko" at gmail.

Thank you for the AMI, I will use it to try to find a repro.

I'm not sure if I can give you an ETA on a fix, but we will look into this this month, especially now that we have a hint as to where problem lies.

Thank you so much!

weretheone commented 3 months ago

@AndrewSirenko sorry, I was away for personal reasons, just sent over the extracted and sanitized logs for you.

weretheone commented 1 week ago

Hello @AndrewSirenko, I forgot about this issue because of other responsibilities but it's still affecting our windows operation. Did you have time to look into the topic?

AndrewSirenko commented 1 week ago

Hi @weretheone, without a local reproduction of the issue on our end we are a bit stuck.

We would like to help, but will need more info from you. Would you be able to do one of the following three options:

  1. Give us exact reproduction steps and manifests for us to reproduce issue locally?
  2. Sign up for premium support and go through official AWS support channels (to get appropriate permissions to go through anonymized cluster logs)
  3. Schedule alive debugging call next time issue shows up on cluster?
weretheone commented 5 days ago

Hi @AndrewSirenko

sure, for me option 2 and 3 are fine. I wasn't able to reproduce the issue locally.

Let's have a call. Then if we need it we can discuss which support plan I would need on AWS to cover these type of support activities. I sent my logs from my business e-mail, you can reply there with call details.