Open weretheone opened 3 months ago
This issue is with us for years and we was only able to mitigate the problem with the daily restart of windows machines, which somewhat solved the problem, but it would be nice to resolve the root cause.
Thank you for filing this issue, this daily restart really should not be needed. I will try to reproduce this issue this week with a similar cron job.
Debugging is extremely hard as after a node restart it can go for days without problem, so if you can provide any debug option that would be really nice.
What kind of debug option would have been helpful for you?
/priority important-soon
Also would you consider filing a customer support request with AWS about this? Going through AWS support would make it easier for EBS to collect more details as to why the volume is not detaching.
Especially given the comment The volume state is "In-use" on AWS UI
, which makes it seem that the driver's EC2 DetachVolume call asynchronously succeeds, but the volume does not detach at the AWS level, which could be why we see the failed to detach: rpc error: code = DeadlineExceeded desc = context deadline exceeded
error.
Thank you
Also the following AWS FAQ item might be helpful to understand root cause: Resolve an EBS volume stuck in the detaching state | AWS re:Post
The log message: "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="busy" desired="detached"
indicates that the DetachVolume operation is delayed due to this busy
volume state. Which necessitates either an unmount of volume or a reboot of instance.
This is strange because the csi driver node pod should have already succeeded in unmounting volume by the time csi controller pod is calling EC2 DetachVolume here. I will continue to investigate issue.
Thank you.
@AndrewSirenko thanks a lot for jumping right into this topic. Hope you will get lucky reproducing the problem.
busy
state. So the options we have are indeed unmount, reboot, force detach.For debug options, I think anything which shows what calls were made to AWS and at what time I think would be useful.
I believe you can already see what calls are made to AWS with controller.sdkDebugLog=true
in our helm chart here or the EKS add-on. (I.e. add --set controller.sdkDebugLog=true
to your helm upgrade
command). Be warned this will print the entire HTTP Request to the logs, which may include sensitive info (see redacted example below)
You can also get more detailed CSI Driver node pod logs with node.logLevel=7
, which would help us see if anything weird happened during the volume's unmount in NodeUnpublish
NodeUnstage
RPC calls that must succeed before ControllerUnpublishVolume
.
We do not have any support plan at the moment with AWS
Understood. If/when we reproduce the issue then I can file a support ticket myself.
In the meantime while we try to reproduce on our end, if you could send any high verbosity (node.logLevel=7) node pod logs from the relevant node (or even kubelet logs), it would help us see whether there was an issue with the CSI Driver unmount operation that might have caused the EC2 DetachVolume API call to get stuck in busy. Thank you!
SDK 2024/07/30 13:24:44 DEBUG Request
POST / HTTP/1.1
... REDACTED headers
X-Amz-Security-Token: REDACTED
Action=DetachVolume&InstanceId=i-04ac&Version=2016-11-15&VolumeId=vol-0e
SDK 2024/07/30 13:24:44 DEBUG Response
HTTP/1.1 200 OK
Content-Length: 412
Cache-Control: no-cache, no-store
Content-Type: text/xml;charset=UTF-8
Date: Tue, 30 Jul 2024 13:24:44 GMT
Server: AmazonEC2
Strict-Transport-Security: max-age=31536000; includeSubDomains
<?xml version="1.0" encoding="UTF-8"?>
<DetachVolumeResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
<requestId>f178c6f1-</requestId>
<volumeId>vol-</volumeId>
<instanceId>i-</instanceId>
<device>/dev/xvdaa</device>
<status>detaching</status>
<attachTime>2024-07-30T01:17:34.000Z</attachTime>
</DetachVolumeResponse>
Thanks for the details, I was not aware that you can set separate loglevel for controller and node. I set up logLevel=7
on our dev environment where the issue happens rarely. (I guess because we do much less attach-detach activity compared to production)
I have to observe this and see how much log it produces, and what exactly pushed into the output before I set this up on production where the issue happens with much faster reproduction rate. Let you know when I succeed with it.
The volume is not yet stuck on the windows node, but here is what you can se already in the node pod's log. FYI these are still from production where the log level not yet set higher.
liveness-probe W0730 08:48:49.736097 6692 connection.go:234] Still connecting to unix:/csi/csi.sock liveness-probe W0730 08:48:59.622531 6692 connection.go:234] Still connecting to unix:/csi/csi.sock ebs-plugin I0730 08:49:13.590787 4200 ec2.go:40] "Retrieving EC2 instance identity metadata" regionFromSession="" ebs-plugin I0730 08:49:18.452961 4200 metadata.go:48] "ec2 metadata is available" ebs-plugin I0730 08:49:18.469549 4200 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.30.0" ebs-plugin I0730 08:49:18.469549 4200 node.go:97] "regionFromSession Node service" region="" node-driver-registrar I0730 08:47:15.681951 14716 main.go:135] Version: v2.10.1 node-driver-registrar I0730 08:47:19.551009 14716 main.go:136] Running node-driver-registrar in mode= node-driver-registrar I0730 08:47:19.551009 14716 main.go:157] Attempting to open a gRPC connection with: "unix:/csi/csi.sock" node-driver-registrar I0730 08:47:21.226395 14716 main.go:164] Calling CSI driver to discover driver name node-driver-registrar I0730 08:47:22.328214 14716 main.go:173] CSI driver name: "ebs.csi.aws.com" node-driver-registrar I0730 08:47:22.363838 14716 node_register.go:55] Starting Registration Server at: /registration/ebs.csi.aws.com-reg.sock node-driver-registrar I0730 08:47:23.068482 14716 node_register.go:64] Registration Server started at: /registration/ebs.csi.aws.com-reg.sock node-driver-registrar I0730 08:47:23.084280 14716 node_register.go:88] Skipping HTTP server because endpoint is set to: "" node-driver-registrar I0730 08:47:23.961337 14716 main.go:90] Received GetInfo call: &InfoRequest{} node-driver-registrar I0730 08:47:24.347512 14716 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,} liveness-probe W0730 08:49:09.708494 6692 connection.go:234] Still connecting to unix:/csi/csi.sock liveness-probe W0730 08:49:19.599337 6692 connection.go:234] Still connecting to unix:/csi/csi.sock liveness-probe I0730 08:49:20.045863 6692 main.go:133] "Calling CSI driver to discover driver name" liveness-probe I0730 08:49:23.012409 6692 main.go:141] "CSI driver name" driver="ebs.csi.aws.com" liveness-probe I0730 08:49:23.012409 6692 main.go:170] "ServeMux listening" address="0.0.0.0:9808" liveness-probe E0730 08:49:29.990645 6692 main.go:67] "Failed to establish connection to CSI driver" err="context deadline exceeded" ebs-plugin I0730 08:49:21.171044 4200 node.go:940] "CSINode Allocatable value is set" nodeName="ip----.us-west-2.compute.internal" count=27 ebs-plugin E0730 08:52:16.938386 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:18.761474 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:21.621097 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:27.253570 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:37.018713 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:52:53.618534 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:53:26.430423 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:54:31.468045 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 08:56:34.111954 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-0a959b8cf5f76a3eb\" is already in progress" ebs-plugin E0730 09:12:17.726274 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:12:20.109665 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:12:23.292616 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:14:16.412201 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:14:24.441689 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-01fff9c86396c9d34\" is already in progress" ebs-plugin E0730 09:32:18.601399 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress" ebs-plugin E0730 09:32:20.280498 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress" ebs-plugin E0730 09:32:22.946024 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress" ebs-plugin E0730 09:32:27.456541 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume=\"vol-05adad3e3cf7854ab\" is already in progress"
The Controller log shows normal activities if I search for the volume number:
ebs-plugin I0731 07:30:09.377907 1 controller.go:422] "ControllerPublishVolume: attached" volumeID="vol-05adad3e3cf7854ab" nodeID="i-0fd306e921dd61994" devicePath="/dev/xvdaa" ebs-plugin I0731 07:31:13.067288 1 controller.go:463] "ControllerUnpublishVolume: detaching" volumeID="vol-05adad3e3cf7854ab" nodeID="i-XXXXXXXXXXXXXXX" ebs-plugin I0731 07:31:14.871563 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="detaching" desired="detached" ebs-plugin I0731 07:31:16.944004 1 cloud.go:1057] "Waiting for volume state" volumeID="vol-05adad3e3cf7854ab" actual="detaching" desired="detached" ebs-plugin I0731 07:31:19.949348 1 controller.go:471] "ControllerUnpublishVolume: detached" volumeID="vol-05adad3e3cf7854ab" nodeID="i-XXXXXXXXXXXXXXX"
Still these GRPC errors looks problematic. Do you have any idea how can I debug that further to see which operations are in progress and contention?
This error
ebs-plugin E0730 09:32:27.456541 4200 driver.go:107] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given volume="vol-05adad3e3cf7854ab" is already in progress"
Typically means that a node driver remote procedure call (e.g. NodeStageVolume
or NodeUnpublishVolume
, i.e. formatting, mounting, or unmounting volume) took so long that the kubelet gave up on listening to the original request. This can happen for example if an unmount takes > 10 minutes for some reason. In the background the node driver would keep doing the underlying system operation, but in the meantime will return operation ... is already in progress
to any new RPCs so that driver doesn't do two simultaneous unmounts at once (unsafe).
Increased node debug logs might show you which operation was slow.
We can take an AI for making this error message more clear by atleast noting which gRPC call it was. But I don't think this is of concern for now.
Thanks for attaching these logs! Still no luck on my windows eks cluster but perhaps the weekend will bring the issue up locally.
Hello @AndrewSirenko
the issue was not came up during the weekend, meanwhile the debug mode was switched on and as I see and understand the operations goes well both attach and detach. This was a successful execution of operations on the volume I had issue last time:
ebs-plugin I0805 08:30:19.194594 12216 node.go:116] "NodeStageVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","publish_context":{"devicePath":"/dev/xvdaa"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ntfs"}},"access_mode":{"mode":1}},"volume_context":{"storage.kubernetes.io/csiProvisionerIdentity":"1721215862357-6437-ebs.csi.aws.com"}}
ebs-plugin I0805 08:30:22.471163 12216 node.go:218] "NodeStageVolume: find device path" devicePath="/dev/xvdaa" source="1"
ebs-plugin I0805 08:30:23.047870 12216 node.go:246] "NodeStageVolume: checking if volume is already staged" device="" source="1" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:30:23.047870 12216 node.go:253] "NodeStageVolume: staging volume" source="1" volumeID="vol-05adad3e3cf7854ab" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount" fstype="ntfs"
ebs-plugin I0805 08:30:28.931169 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:30:39.033916 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:30:46.809994 12216 node.go:297] "NodeStageVolume: successfully staged volume" source="1" volumeID="vol-05adad3e3cf7854ab" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount" fstype="ntfs"
ebs-plugin I0805 08:30:46.809994 12216 node.go:195] "NodeStageVolume: volume operation finished" volumeID="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:46.811431 12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:47.247914 12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:47.315271 12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:47.658103 12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:47.819075 12216 node.go:416] "NodePublishVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","publish_context":{"devicePath":"/dev/xvdaa"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount","target_path":"c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ntfs"}},"access_mode":{"mode":1}},"volume_context":{"storage.kubernetes.io/csiProvisionerIdentity":"1721215862357-6437-ebs.csi.aws.com"}}
ebs-plugin I0805 08:30:47.821039 12216 mount_windows.go:75] "NodePublishVolume: removing dir" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"
ebs-plugin I0805 08:30:47.823230 12216 node.go:713] "[Debug] NodePublishVolume: Target path does not exist" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"
ebs-plugin I0805 08:30:47.823230 12216 node.go:757] "NodePublishVolume: mounting" source="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount" mountOptions=["bind"] fsType="ntfs"
ebs-plugin I0805 08:30:47.824150 12216 node.go:445] "NodePublishVolume: volume operation finished" volumeId="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:47.824150 12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:30:49.040433 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:30:54.865231 12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:30:54.870603 12216 node.go:499] "NodeGetVolumeStats: called" args={"volume_id":"vol-05adad3e3cf7854ab","volume_path":"c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"}
ebs-plugin I0805 08:30:59.060832 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:08.943760 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:18.877191 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:29.036613 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:39.023734 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:45.674123 12216 node.go:469] "NodeUnpublishVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","target_path":"c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"}
ebs-plugin I0805 08:31:45.674123 12216 node.go:489] "NodeUnpublishVolume: unmounting" target="c:\\var\\lib\\kubelet\\pods\\7f1f887a-808f-424f-bde8-a834fca9ea0b\\volumes\\kubernetes.io~csi\\pvc-67998e30-a835-4824-a0e3-349f2067c062\\mount"
ebs-plugin I0805 08:31:45.675213 12216 node.go:485] "NodeUnPublishVolume: volume operation finished" volumeId="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:31:45.675213 12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:31:46.033755 12216 node.go:562] "NodeGetCapabilities: called" args={}
ebs-plugin I0805 08:31:46.035954 12216 node.go:302] "NodeUnstageVolume: called" args={"volume_id":"vol-05adad3e3cf7854ab","staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"}
ebs-plugin I0805 08:31:48.932973 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:31:49.451303 12216 safe_mounter_windows.go:234] "GetDeviceNameFromMount called" diskNumber=1 volumeID="\\\\?\\Volume{6b1309bd-a16a-41a3-8664-c20d40c32b9a}\\" mountPath="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:31:49.451303 12216 node.go:342] "NodeUnstageVolume: unmounting" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:31:58.891739 12216 identity.go:61] "Probe: called" args={}
ebs-plugin I0805 08:32:02.016531 12216 safe_mounter_windows.go:138] "Successfully unmounted volume" diskNumber=1 volumeId="\\\\?\\Volume{6b1309bd-a16a-41a3-8664-c20d40c32b9a}\\" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:32:02.016531 12216 node.go:347] "NodeUnStageVolume: successfully unstaged volume" volumeID="vol-05adad3e3cf7854ab" target="\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\ebs.csi.aws.com\\9b24cdbd7deacb244ee207dea54cec25927688665b822fec891aef28f7f18b64\\globalmount"
ebs-plugin I0805 08:32:02.016531 12216 node.go:317] "NodeUnStageVolume: volume operation finished" volumeID="vol-05adad3e3cf7854ab"
ebs-plugin I0805 08:32:02.016531 12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
Sorry but as it looks I still have to wait.
Today morning the issue occurred again and I've seen two volumes vol-0a959b8cf5f76a3eb and vol-05adad3e3cf7854ab stuck for hours.
For debug purpose I searched for the string "vol-05adad3e3cf7854ab" which I mentioned earlier in the ticket. This way I was able to pinpoint the exact time the issue happened. In the attached picture you can see that normal execution happened every hour yesterday until 3:30 when something happened and the volumeID logged continuously in the node's log for a 210 minutes. During this time the pod was active, then as we have an Argo activeDeadlineSeconds
set for 210 minutes the pod terminated by Argo.
After this time the controller's log shows increased amount of volume_id.
Checked the two timepoints to extract the logs around them. In the csi-attacher it seems that the last successful attachment was at 03:30 and detaching never called up until the pod was cancelled.
Other than this I only found messages I was already familiar with. What messages or strings I should look for?
Ah thank you for the reproduction case.
What messages or strings I should look for?
Do you have any logs from the stuck hours on the driver node pod (kube-system, ebs-csi-node
or ebs-csi-node-windows
subsystem) near a NodeUnpublishVolume
or NodeUnstageVolume
message?
The node daemonset pod would give us a hint as to why the volume might not be detaching properly. Maybe some error message in the unmount that would give us clue.
From the logs I see above, I can only gather that controller pod cannot detach volume, which we already know. I assume this is because of that same busy
issue from before.
From ^^ looks like there is a lot of activity in the node pod before the activity in controller pod, (compared to the usual quick spike of node pod logs followed by quick spike of controller logs as seen in left side of graph). Could you copy paste some of those logs?
Looking into the logs, the last successful execution with "small spikes" was at 02:32 when both NodeUnpublishVolume
and NodeUnstageVolume
was called.
Then I see no occurrence of this message anymore in the logs:
So I searched for the NodeStageVolume
message which is called first then immediately followed by this GRPC error and two info message that the volume operation finished:
E0806 01:30:18.301730 12216 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdab. disk number for device path \"/dev/xvdab\" volume id \"vol-05adad3e3cf7854ab I0806 01:30:18.301730 12216 node.go:195] "NodeStageVolume: volume operation finished" volumeID="vol-05adad3e3cf7854ab" I0806 01:30:18.301730 12216 inflight.go:74] "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
Then rest of the messages from the node are continuous calls to NodeStageVolume-> GRPC error -> "volume operation finished" These messages continue until the pod gets terminated on deadline.
Meanwhile this can be seen on the controller:
So as I understand the "stage" call goes trough, but because of some communication issue from the windows node it keeps calling and not able to register the success despite the node reporting: "Node Service: volume operation finished" key="vol-05adad3e3cf7854ab"
The controller is keeping the volume in the batch describe until the deadline cancellation of the pod, when it tries to detach the volume.
Here we see an actual volume state of "detaching" for a minute then it becomes "busy" and stays like this forever.
Hey @weretheone, thank you for reproducing and for your patience. This does not look like intended CSI Driver behavior, but I believe we are getting closer to a solution.
Your screenshots and timeline point to an issue in NodeStageVolume
RPC never succeeding at the kubelet+ ebs csi driver node pod level, but the actual format+mount succeeding. This leads to NodeUnstageVolume
never being called which means volume never gets unmounted, which would explain the busy volume state when we try to detach it (because EBS docs show that volume MUST be unmounted before it can be detached)
I've taken an action item to take a closer look at our NodeStageVolume window's code path. If the format + mount succeeded, there is no reason the gRPC error should keep occuring.
Can you help answer a few followups, given that I have yet to reproduce this on my Windows cluster?
02:30: The last time `NodeUnpublishVolume`/`UnstageVolume` were called
03:30 `NodeStageVolume` was called, failed with `"GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdab. disk number for device path "/dev/xvdab" volume id "vol-05adad3e3cf7854ab`
03:30 - ??: NodeStageVolume keeps succeeding with "volume operation finished"
07:05: Pod cancellation -> Detach
07:10 until node reboot: ControllerUnpublish failures due to state `busy`
Did the pod ever start running? Or was it always stuck in ContainerCreating. (Asking because the pod should never start running if NodeStageVolume
RPC never succeeded`
Can you dump the driver node and controller pod logs into a pastebin for us to look at? It's a bit hard to piece together these screenshots, and ControllerUnpublishVolume
should never have been called if NodeUnstage NodeUnpublish were never called.
Would you have access to Kubelet level logs for node? Or any instance logs that would show the actual mount
format
windows syscalls? If so can you attach them?
Next time you catch a node that is stuck like this, would you be able to manually schedule another EBS-backed stateful workload on that cluster before the node is rebooted? I want to isolate this issue to either "NodeStageVolume doesn't work for whole node" or "Something is wrong with this particular volume / devicePath".
Next time you catch a node that is stuck like this, what happens if you manually unmount the volume by ssh-ing into the node? If you manually unmount it, then the next detach volume call may succeed, and this would isolate the problem to NodeStageVolume
not properly reporting a success.
Can you share the AMI you are using if it is not private? So I can better mirror your workflow to reproduce? I assume I can also simulate your workflow by deploying an hourly stateful cronjob that writes and reads from EBS-backed mounted filesystem.
Hi @AndrewSirenko,
for sure, I will try to answer your follow up questions.
These followup answers are very helpful/
This looks like more of a "Kubernetes never realized that the volume was mounted" rather than a "Volume cannot detach" problem, which narrows down our search. The root cause of busy
seems to be that the node driver pod never unmounts the volume (NodeUnstageVolume
) because the NodeStageVolume
RPC was never reported as successful to Kubelet.
Regarding email, mine is "andrewsirenko" at gmail.
Thank you for the AMI, I will use it to try to find a repro.
I'm not sure if I can give you an ETA on a fix, but we will look into this this month, especially now that we have a hint as to where problem lies.
Thank you so much!
@AndrewSirenko sorry, I was away for personal reasons, just sent over the extracted and sanitized logs for you.
Hello @AndrewSirenko, I forgot about this issue because of other responsibilities but it's still affecting our windows operation. Did you have time to look into the topic?
Hi @weretheone, without a local reproduction of the issue on our end we are a bit stuck.
We would like to help, but will need more info from you. Would you be able to do one of the following three options:
Hi @AndrewSirenko
sure, for me option 2 and 3 are fine. I wasn't able to reproduce the issue locally.
Let's have a call. Then if we need it we can discuss which support plan I would need on AWS to cover these type of support activities. I sent my logs from my business e-mail, you can reply there with call details.
/kind bug
What happened? In a mixed cluster, on a Windows node VolumeAttachment stuck for hours with detach error:
We never had issue on a Linux based nodes, but on Windows nodes after a certain run time we face the above issue. Usually it happens after like 2-3 days of uptime despite the daily workload (and the operation volume of attach - detach) is identical every day.
The volume state is "In-use" on AWS UI and this can be found in the logs:
This issue is with us for years and we was only able to mitigate the problem with the daily restart of windows machines, which somewhat solved the problem, but it would be nice to resolve the root cause.
What you expected to happen? Volumes get detached correctly.
How to reproduce it (as minimally and precisely as possible)? Configure and startup the driver with option
enableWindows: true
on mixed cluster on a windows node. Define the Storage class and PVC for your workload. We use Argo Workflows to schedule jobs which utilize the defined PVCs and attach volumes to pods. The pods execute activities and upon completion it detaches. After a few day of correct execution the communication between thecontroller which runs on a linux machine and the ebs-csi-node-windows pod which runs on a windows machine will experience communication problem, which later result in volumes gets stuck.
Anything else we need to know?: We tried various debug methods and read trough a ton of issues connected to gRPC problems, but we was not able to pinpoint the root cause of the problem. Debugging is extremely hard as after a node restart it can go for days without problem, so if you can provide any debug option that would be really nice.
Environment