Volume still hang on Karpenter Node Consolidation/Termination

levanlongktmt commented 8 months ago

/kind bug

What happened? As discussed at https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1665, @torredil said it's fixed in v1.27 (https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1665#issuecomment-1898607940) but we still got problem with v1.28

The pod using volume pv-A running in node N1
Karpenter terminate pod and terminate node N1
K8s start new pod and trying attach volume pv-A but still need to wait 6 minutes to be release and attach to new Pod

What you expected to happen?

After old pod has been terminated, the pv-A should be released and able to attach to new pod

How to reproduce it (as minimally and precisely as possible)?

Im using https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html to setup cluster with 3 nodes

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: dev
spec:
version: 8.12.2
volumeClaimDeletePolicy: DeleteOnScaledownAndClusterDeletion
updateStrategy:
changeBudget:
  maxSurge: 2
  maxUnavailable: 1
nodeSets:
- name: default
count: 3
volumeClaimTemplates:
- metadata:
    name: elasticsearch-data # Do not change this name unless you set up a volume mount for the data path.
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 25Gi
podTemplate:
  spec:
    nodeSelector:
      kubernetes.io/arch: arm64
      topology.kubernetes.io/zone: eu-central-1a
    containers:
    - name: elasticsearch
      env:
        - name: ES_JAVA_OPTS
          value: -Xms4g -Xmx4g
      resources:
        requests:
          memory: 5Gi
          cpu: 1
        limits:
          memory: 5Gi
          cpu: 2
config:
  node.store.allow_mmap: false

Trigger spot instance termination or just delete 1 ec2 instance
The node has been removed in k8s very quick, old pod has been Terminated and k8s start new pod
Pod stuck in 6 minutes with error Multi-Attach error for volume "pvc-xxxxx-xxxxx-xxx" Volume is already exclusively attached to one node and can't be attached to another
After 6 minutes new pod can attach volume

Here is logs of ebs-csi-controller


I0302 06:12:10.305080       1 controller.go:430] "ControllerPublishVolume: attached" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1" devicePath="/dev/xvdaa"
<< at 06:14 the node has been terminated but no logs here >>
I0302 06:20:18.486042       1 controller.go:471] "ControllerUnpublishVolume: detaching" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1"
I0302 06:20:18.584737       1 cloud.go:792] "DetachDisk: called on non-attached volume" volumeID="vol-02b33186429105461"
I0302 06:20:18.807752       1 controller.go:474] "ControllerUnpublishVolume: attachment not found" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1"
I0302 06:20:19.124534       1 controller.go:421] "ControllerPublishVolume: attaching" volumeID="vol-02b33186429105461" nodeID="i-0ee2a470112401ffb"
I0302 06:20:20.635493       1 controller.go:430] "ControllerPublishVolume: attached" volumeID="vol-02b33186429105461" nodeID="i-0ee2a470112401ffb" devicePath="/dev/xvdaa"


**Anything else we need to know?**:
I setup csi driver using eks add-on
**Environment**
- Kubernetes version (use `kubectl version`):

Client Version: v1.29.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.0-eks-c417bb3


- Driver version: `v1.28.0-eksbuild.1`

levanlongktmt commented 8 months ago

I updated to karpenter 0.35 and using AL2023 image, the problem still happen

levanlongktmt commented 8 months ago

some update: it seemly caused by elasticsearch statefulset, time to terminate es pod is so slow, it's possible that the driver pod be killed before es pod killed then the volume not released

levanlongktmt commented 8 months ago

update: when I set PRE_STOP_ADDITIONAL_WAIT_SECONDS to 5 seconds, new pod can attach pvc normally, so I think the long delay of pod stop is reason for volume stuck

levanlongktmt commented 8 months ago

update: when I set PRE_STOP_ADDITIONAL_WAIT_SECONDS to 5 seconds, new pod can attach pvc normally, so I think the long delay of pod stop is reason for volume stuck

Nope, in another try pv still stuck 🤦

torredil commented 8 months ago

Hey there @levanlongktmt thank you for reporting this.

It appears your spot instances are being ungracefully terminated. Take a look at the documentation for guidance on navigating this issue: 6-Minute Delays in Attaching Volumes - What steps can be taken to mitigate this issue?

If you are still running into delays but have already configured the Kubelet and enabled interruption handling in Karpenter, please let us know. Thanks!

levanlongktmt commented 8 months ago

@torredil ops... my user data missed last line

systemctl restart kubelet

let me try with it

levanlongktmt commented 8 months ago

hey @torredil I tried but still no luck :( I already added SQS to handle interrupt, so when I trigger interupt in aws console, karpenter handle and launch new instance very quickly. I also follow logs of ebs-csi-node but it not show anything related to prestop hook, here is all logs of it

I0304 15:40:38.249081       1 driver.go:83] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.28.0"
I0304 15:40:38.250134       1 node.go:93] "regionFromSession Node service" region=""
I0304 15:40:38.252657       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I0304 15:40:38.257835       1 metadata.go:92] "ec2 metadata is available"
I0304 15:40:38.258488       1 metadata_ec2.go:25] "Retrieving EC2 instance identity metadata" regionFromSession=""
I0304 15:47:29.651847       1 mount_linux.go:243] Detected OS without systemd

after graceful shutdown of elasticsearch pod, the pod ebs-csi-node has been killed immediately. So the prestop has been ignored. I was thought it might because bug of AL2023, so I switch back to use AL2 but still no luck.

levanlongktmt commented 8 months ago

I will try set InhibitDelayMaxSec to 60 (45 + 15)

levanlongktmt commented 8 months ago

Still no luck, volume still be stuck in 6 minutes and prestop hook not working 😭

torredil commented 8 months ago

@levanlongktmt Thanks for the followup info. Something that came to mind here is that it might be possible for newer versions of Karpenter to be incompatible with the pre-stop lifecycle hook due to this change: https://github.com/kubernetes-sigs/karpenter/pull/508.

You'll notice that the LCH won't run if the following taint is not present on the node during a termination event:

node.kubernetes.io/unschedulable

We'll take a closer look and report back with more details or a fix. Thank you!

levanlongktmt commented 8 months ago

@torredil here is logs of karpenter and elasticsearch pod, so as I see

At 16:22:20 karpenter get interupt message
Very quickly karpenter call taint node and pod elasticsearch get shutdown signal
After 46 seconds the pod elasticsearch terminated
At 16:23:09 karpenter start delete node and the node has been deleted very quick

Logs of Karpenter

{"level":"INFO","time":"2024-03-04T16:22:20.270Z","logger":"controller.interruption","message":"initiating delete from interruption message","commit":"2c8f2a5","queue":"Karpenter-kollekt-eks","messageKind":"SpotInterruptionKind","nodeclaim":"default-2bd27","action":"CordonAndDrain","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:22:20.336Z","logger":"controller.node.termination","message":"tainted node","commit":"2c8f2a5","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:22:21.364Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"24.462475ms"}
{"level":"INFO","time":"2024-03-04T16:22:21.364Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"2c8f2a5","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-03-04T16:22:21.379Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"2c8f2a5","nodepool":"default","nodeclaim":"default-pk44s","requests":{"cpu":"680m","memory":"5240Mi","pods":"5"},"instance-types":"a1.2xlarge, a1.4xlarge, a1.metal, a1.xlarge, c6g.12xlarge and 91 other(s)"}
{"level":"INFO","time":"2024-03-04T16:22:23.782Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"2c8f2a5","nodeclaim":"default-pk44s","provider-id":"aws:///eu-central-1a/i-09a52eea14ff0b6c9","instance-type":"r6gd.medium","zone":"eu-central-1a","capacity-type":"spot","allocatable":{"cpu":"940m","ephemeral-storage":"17Gi","memory":"7075Mi","pods":"8","vpc.amazonaws.com/pod-eni":"4"}}
{"level":"INFO","time":"2024-03-04T16:22:31.356Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"16.586479ms"}
{"level":"INFO","time":"2024-03-04T16:22:41.353Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"13.509969ms"}
{"level":"INFO","time":"2024-03-04T16:22:51.355Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"14.534034ms"}
{"level":"INFO","time":"2024-03-04T16:22:58.280Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"2c8f2a5","nodeclaim":"default-pk44s","provider-id":"aws:///eu-central-1a/i-09a52eea14ff0b6c9","node":"i-09a52eea14ff0b6c9.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:23:01.356Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"13.781092ms"}
{"level":"INFO","time":"2024-03-04T16:23:09.571Z","logger":"controller.node.termination","message":"deleted node","commit":"2c8f2a5","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:23:09.944Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"2c8f2a5","nodeclaim":"default-2bd27","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1a/i-0502b920c79f40b5a"}
{"level":"INFO","time":"2024-03-04T16:23:10.455Z","logger":"controller.disruption","message":"pod \"datalayer/dev-es-default-0\" has a preferred Anti-Affinity which can prevent consolidation","commit":"2c8f2a5"}
{"level":"INFO","time":"2024-03-04T16:23:10.698Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"83.215948ms"}
{"level":"INFO","time":"2024-03-04T16:23:10.731Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"2c8f2a5","nodeclaim":"default-pk44s","provider-id":"aws:///eu-central-1a/i-09a52eea14ff0b6c9","node":"i-09a52eea14ff0b6c9.eu-central-1.compute.internal","allocatable":{"cpu":"940m","ephemeral-storage":"18233774458","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"7552796Ki","pods":"8"}}
{"level":"INFO","time":"2024-03-04T16:23:12.487Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"12.009019ms"}

Logs of elasticsearch pod

{"@timestamp": "2024-03-04T16:22:20+00:00", "message": "retrieving node ID", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-03-04T16:22:21+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp":"2024-03-04T16:22:21.338Z", "log.level": "INFO", "message":"creating shutdown record {nodeId=[qElhr0BPQnufjz5_CdDGMg], type=[RESTART], reason=[pre-stop hook]}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][masterService#updateTask][T#2]","log.logger":"org.elasticsearch.xpack.shutdown.TransportPutShutdownNodeAction","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:22:21.530Z", "log.level": "INFO", "message":"Aborting health node task due to node [{dev-es-default-0}{qElhr0BPQnufjz5_CdDGMg}] shutting down.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:22:21.535Z", "log.level": "INFO", "message":"Starting node shutdown sequence for ML", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ml.MlLifeCycleService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp": "2024-03-04T16:22:21+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-03-04T16:22:21+00:00", "message": "delaying termination for 44 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp":"2024-03-04T16:23:05.782Z", "log.level": "INFO", "message":"stopping ...", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.783Z", "log.level": "INFO", "message":"shutting down watcher thread", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.common.file.AbstractFileWatchingService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.785Z", "log.level": "INFO", "message":"watcher service stopped", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.common.file.AbstractFileWatchingService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.790Z", "log.level": "INFO", "message":"[controller/83] [Main.cc@176] ML controller exiting", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"ml-cpp-log-tail-thread","log.logger":"org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.792Z", "log.level": "INFO", "message":"Native controller process has stopped - no new native processes can be started", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"ml-cpp-log-tail-thread","log.logger":"org.elasticsearch.xpack.ml.process.NativeController","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.793Z", "log.level": "INFO", "message":"stopping watch service, reason [shutdown initiated]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.xpack.watcher.WatcherService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.794Z", "log.level": "INFO", "message":"watcher has stopped and shutdown", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][generic][T#3]","log.logger":"org.elasticsearch.xpack.watcher.WatcherLifeCycleService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"timestamp": "2024-03-04T16:23:06+00:00", "message": "readiness probe failed", "curl_rc": "7"}
{"@timestamp":"2024-03-04T16:23:06.552Z", "log.level": "INFO", "message":"stopped", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.554Z", "log.level": "INFO", "message":"closing ...", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.576Z", "log.level": "INFO", "message":"evicted [0] entries from cache after reloading database [/tmp/elasticsearch-3976228296045036888/geoip-databases/qElhr0BPQnufjz5_CdDGMg/GeoLite2-Country.mmdb]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.577Z", "log.level": "INFO", "message":"evicted [0] entries from cache after reloading database [/tmp/elasticsearch-3976228296045036888/geoip-databases/qElhr0BPQnufjz5_CdDGMg/GeoLite2-ASN.mmdb]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.578Z", "log.level": "INFO", "message":"evicted [25] entries from cache after reloading database [/tmp/elasticsearch-3976228296045036888/geoip-databases/qElhr0BPQnufjz5_CdDGMg/GeoLite2-City.mmdb]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.586Z", "log.level": "INFO", "message":"closed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}

levanlongktmt commented 8 months ago

@torredil do you think this quick and dirty fix will works 😆?

levanlongktmt commented 8 months ago

@torredil seemly k8s not call preStop, because if it call then atleast I will see log PreStop: executing PreStop lifecycle hook, but I didn't see it

levanlongktmt commented 8 months ago

@torredil any good news for this 😀?

ConnorJC3 commented 7 months ago

Hi, sorry about the wait - your issue is probably caused by what #1969 solves - Karpenter changed the taints they used when draining nodes and our LCH needs to be changed to account for it. That fix should be available in the next release of the EBS CSI Driver, expected to happen later this month.

levanlongktmt commented 7 months ago

Amazing, thank so much @ConnorJC3 😍 I will test again when new version released

primeroz commented 7 months ago

@ConnorJC3 I was trying to understand from which version of karpenter this was changed , i was about to upgrade my csi driver to address the fix but i guess i will wait for the next release.

I see https://github.com/kubernetes-sigs/karpenter/pull/508 was merged in October :thinking:

levanlongktmt commented 7 months ago

@primeroz here is list of karpenter versions affected

alexandermarston commented 7 months ago

v1.29.0 has been released which contains the fix.

primeroz commented 7 months ago

I just upgraded my dev cluster to 1.29.0 while using karpenter-aws 0.35.2 but still have the same multi attach problem as before

Have stateful workload running with PVC Mounted
Delete a node with kubectl delete node
- karpenter intercepts through finalizer and sets taint
effect: NoSchedule key: karpenter.sh/disruption value: disrupting
Pod is evicted - 26s Normal Evicted pod/c-pod-0 Evicted pod
new Node is created by karpenter and pod is moved to it
Multi-Attach error for volume "pvc-9b50cf31-5a24-4783-8039-a362ad2a7c0d" Volume is already exclusively attached to one node and can't be attached to another for 6+ minutes until timeout
pod starts on new node

alexandermarston commented 7 months ago

@primeroz do you have the logs from the driver, when your node was disrupted? (also, would be useful for you to increase your log level while collecting these)

primeroz commented 7 months ago

@alexandermarston there are no events whatsoever when i delete the node in the ebs csi pod running on that node

I can see the prestop is in the pod spec ... so i guess when we delete a node rather than replace it through karpenter disruption the prestop on the ebs node is never run ?

Note in my case i am deleting the node to get it replaced by karpenter.

+ ebs-csi-node-rj6qg › node-driver-registrar
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.381506       1 main.go:135] Version: v2.10.0
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.381558       1 main.go:136] Running node-driver-registrar in mode=
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.381565       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.382280       1 main.go:164] Calling CSI driver to discover driver name
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.385954       1 main.go:173] CSI driver name: "ebs.csi.aws.com"
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.385996       1 node_register.go:55] Starting Registration Server at: /registration/ebs.csi.aws.com-reg.sock
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.386143       1 node_register.go:64] Registration Server started at: /registration/ebs.csi.aws.com-reg.sock
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.386255       1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.832022       1 main.go:90] Received GetInfo call: &InfoRequest{}
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.857979       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
ebs-csi-node-rj6qg liveness-probe I0322 14:33:44.130483       1 main.go:133] "Calling CSI driver to discover driver name"
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022623.64,"caller":"driver/driver.go:84","msg":"Driver Information","v":0,"Driver":"ebs.csi.aws.com","Version":"v1.29.0"}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022623.6697,"caller":"driver/node.go:97","msg":"regionFromSession Node service","v":0,"region":""}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022623.6895,"caller":"cloud/metadata.go:85","msg":"retrieving instance data from ec2 metadata","v":0}
ebs-csi-node-rj6qg liveness-probe I0322 14:33:44.133455       1 main.go:141] "CSI driver name" driver="ebs.csi.aws.com"
ebs-csi-node-rj6qg liveness-probe I0322 14:33:44.133475       1 main.go:170] "ServeMux listening" address="0.0.0.0:9808"
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022625.4724,"caller":"cloud/metadata.go:92","msg":"ec2 metadata is available","v":0}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022626.2078,"caller":"cloud/metadata_ec2.go:25","msg":"Retrieving EC2 instance identity metadata","v":0,"regionFromSession":""}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118023644.7207,"caller":"driver/node.go:875","msg":"Unexpected failure when attempting to remove node taint(s)","err":"isAllocatableSet: driver not found on node ip-10-1-5-195.us-west-2.compute.internal"}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118024155.2048,"caller":"driver/node.go:961","msg":"CSINode Allocatable value is set","v":0,"nodeName":"ip-10-1-5-195.us-west-2.compute.internal","count":25}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118326119.883,"caller":"mount-utils@v0.29.2/mount_linux.go:243","msg":"Detected OS without systemd","v":2}

### Delete node here 

- ebs-csi-node-rj6qg › ebs-plugin
- ebs-csi-node-rj6qg › liveness-probe
+ ebs-csi-node-rj6qg › liveness-probe
- ebs-csi-node-rj6qg › node-driver-registrar
+ ebs-csi-node-rj6qg › ebs-plugin
+ ebs-csi-node-rj6qg › node-driver-registrar
- ebs-csi-node-rj6qg › ebs-plugin
- ebs-csi-node-rj6qg › node-driver-registrar
- ebs-csi-node-rj6qg › liveness-probe

ConnorJC3 commented 7 months ago

@primeroz what exactly do you mean by "delete a node"? Assuming you mean something like deleting the node via the AWS Console - that sounds like ungraceful termination, which will not run pre-stop hooks as it results in an immediate and uninterruptible shutdown.

primeroz commented 7 months ago

@ConnorJC3 I mean kubectl delete node XXX which is something karpenter supports and manages through finalizer, it goes through the same steps as a karpenter managed disruption - see https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/#8-delete-karpenter-nodes-manually

Anyway i am testing now with a karpenter disruption by adding an annotation to nodepool, which will trigger a full replacement of all nodes created by that nodepool since the specs have changed. same multi-attach error happens

collecting logs on next node termination

alexandermarston commented 7 months ago

OK, I've been able to test this by manually tainting a node and then running the preStop hook with:

k exec ebs-csi-node-l7zzf -- "/bin/aws-ebs-csi-driver" "pre-stop-hook"

The preStop hook is doing everything it should, from what I understand.

Again, from my limited understanding, I imagine the issue is that your service (which is using the volume) is taking longer than the terminationGracePeriodSeconds of the EBS CSI driver to shutdown and release the volume. If your service takes longer than 30 seconds to terminate, then the EBS driver will never have a chance to do its work.

You could either try increasing the terminationGracePeriod of the EBS CSI Driver or lower the grace period of your service.

primeroz commented 7 months ago

If your service takes longer than 30 seconds to terminate, then the EBS driver will never have a chance to do its work.

You could either try increasing the terminationGracePeriod of the EBS CSI Driver or lower the grace period of your service.

This is entirely possible, i am testing indeed with a service with an higher than normal grace period

I will test this :pray:

primeroz commented 7 months ago

@alexandermarston

Increasing the terminationGracePeriod: 3700 on the ebs node daemonset did not help . triggered an update of nodes in ndoepool by updating an annotation but still failing with multi-attach errors

Also what's interesting, to me at least, is that in the logs of the ebs csi node i never see klog.InfoS("PreStop: executing PreStop lifecycle hook") so i guess that when karpenter replaces a node it does never trigger a termination of the daemonsets running on it ?

I know for sure that karpenter does not drain the node in the same way that kubectl drain does. I can see that the non daemonset pods are being evicted one by one but I see nothing on the ebs nodes

I have been watching the ebs-csi-node pod , i have been tailing events for both the ebs-csi-node pod ... nothing. it seems like once the non daemonsets pods are all evicted the node just goes away and so all the daemonsets pods on it :thinking:

alexandermarston commented 7 months ago

You won't see the logs from the preStop hook handler, as they are not logged in the same way normal pod stdout is. You will be able to see if the preStop hook failed in the Kubernetes Events though.

Can you try deleting the node again, but manually execute the preStop hook and share the output?

k exec <EBS-CSI-POD-NAME> -- "/bin/aws-ebs-csi-driver" "pre-stop-hook"

ConnorJC3 commented 7 months ago

You won't see the logs from the preStop hook handler, as they are not logged in the same way normal pod stdout is.

To be specific, Kubernetes logs output from pre-stop hooks to the kubelet log - so you can look there to determine if the hook is running (although this can be difficult on a terminated node if you aren't streaming the logs anywhere).

@alexandermarston 's suggestion above is a good idea to check that the hook is able to run and connect to the Kubernetes API server on your cluster.

primeroz commented 7 months ago

@ConnorJC3 @alexandermarston thanks for the input. i thought all prestop logs would go to events that's why i was watching them

I will do both of this things, given i have only 3 nodes with 3 pods i will

on the first one run k exec <EBS-CSI-POD-NAME> -- "/bin/aws-ebs-csi-driver" "pre-stop-hook" as soon as the statefulset pod for my workload on that node goes into terminating state
on the other 2 i will tail the kubelet logs to check what happens

and report back here

levanlongktmt commented 7 months ago

my cluster using eks addon to setup ebs driver and the addon still in v1.28 😂 @ConnorJC3 is it safe if I just delete add on then re install via helm chart?

primeroz commented 7 months ago

k exec gave me mixed results.

the output looked like it did not have any error nor any problem talking to apiserver

➜ k exec -n kube-system ebs-csi-node-jzpf5  -- /bin/aws-ebs-csi-driver pre-stop-hook                                                                                                                                                            
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe                                                                                                                                                      
I0322 16:11:40.620309      35 prestop.go:32] "PreStop: executing PreStop lifecycle hook"                                                                                                                                                        
I0322 16:11:40.631461      35 prestop.go:45] "PreStop: node is being drained, checking for remaining VolumeAttachments" node="ip-10-1-3-134.us-west-2.compute.internal"                                                                         
I0322 16:11:40.649793      35 prestop.go:123] "isVolumeAttachmentEmpty: not ready to exit, found VolumeAttachment" attachment={"metadata":{"name":"csi-5ca3c6bf5a8925b2895507cbf7ffd7b4607324e09ef8a8ae2baf300a6eab3511","uid":"aee7ad0f-fa95-49
42-9093-3ff2ff64c927","resourceVersion":"800979653","creationTimestamp":"2024-03-22T16:04:23Z","annotations":{"csi.alpha.kubernetes.io/node-id":"i-0a4477e4eb2332f3f"},"finalizers":["external-attacher/ebs-csi-aws-com"],"managedFields":[{"man
ager":"csi-attacher","operation":"Update","apiVersion":"storage.k8s.io/v1","time":"2024-03-22T16:04:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:csi.alpha.kubernetes.io/node-id":{}},"f:finalizers":{".":{
},"v:\"external-attacher/ebs-csi-aws-com\"":{}}}}},{"manager":"kube-controller-manager","operation":"Update","apiVersion":"storage.k8s.io/v1","time":"2024-03-22T16:04:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:attacher":{},"f:nod
eName":{},"f:source":{"f:persistentVolumeName":{}}}}},{"manager":"csi-attacher","operation":"Update","apiVersion":"storage.k8s.io/v1","time":"2024-03-22T16:04:25Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:attached":{},"f:attachmen
tMetadata":{".":{},"f:devicePath":{}}}},"subresource":"status"}]},"spec":{"attacher":"ebs.csi.aws.com","source":{"persistentVolumeName":"pvc-15457627-f532-41ca-b87b-a961ef7b745c"},"nodeName":"ip-10-1-3-134.us-west-2.compute.internal"},"stat
us":{"attached":true,"attachmentMetadata":{"devicePath":"/dev/xvdaa"}}} node="ip-10-1-3-134.us-west-2.compute.internal"                                                                                                                         
I0322 16:11:58.414556      35 prestop.go:123] "isVolumeAttachmentEmpty: not ready to exit, found VolumeAttachment" attachment={"metadata":{"name":"csi-5ca3c6bf5a8925b2895507cbf7ffd7b4607324e09ef8a8ae2baf300a6eab3511","uid":"aee7ad0f-fa95-49
42-9093-3ff2ff64c927","resourceVersion":"800988339","creationTimestamp":"2024-03-22T16:04:23Z","deletionTimestamp":"2024-03-22T16:11:58Z","deletionGracePeriodSeconds":0,"annotations":{"csi.alpha.kubernetes.io/node-id":"i-0a4477e4eb2332f3f"}
,"finalizers":["external-attacher/ebs-csi-aws-com"],"managedFields":[{"manager":"csi-attacher","operation":"Update","apiVersion":"storage.k8s.io/v1","time":"2024-03-22T16:04:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotati
ons":{".":{},"f:csi.alpha.kubernetes.io/node-id":{}},"f:finalizers":{".":{},"v:\"external-attacher/ebs-csi-aws-com\"":{}}}}},{"manager":"kube-controller-manager","operation":"Update","apiVersion":"storage.k8s.io/v1","time":"2024-03-22T16:04
:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:attacher":{},"f:nodeName":{},"f:source":{"f:persistentVolumeName":{}}}}},{"manager":"csi-attacher","operation":"Update","apiVersion":"storage.k8s.io/v1","time":"2024-03-22T16:04:25Z","f
ieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:attached":{},"f:attachmentMetadata":{".":{},"f:devicePath":{}}}},"subresource":"status"}]},"spec":{"attacher":"ebs.csi.aws.com","source":{"persistentVolumeName":"pvc-15457627-f532-41ca-b87b-a
961ef7b745c"},"nodeName":"ip-10-1-3-134.us-west-2.compute.internal"},"status":{"attached":true,"attachmentMetadata":{"devicePath":"/dev/xvdaa"}}} node="ip-10-1-3-134.us-west-2.compute.internal"

in 1 test there was no multi-attach error but in next 2 i tried there was multi-attach error

Not sure if it makes any different but in the successful test the loop run twice and the not ready to exit was indeed shown twice. In the failed tests it only run once than command returned.

Watching kubelet logs on the AWS Console through Session Manager ( this is the full log from the moment the node starts to get drained by karpenter , i only remove the Probe Failed lines and some lines about a broken daemonset pods failing to start)

multi attach error happened during this test - Multi-Attach error for volume "pvc-15457627-f532-41ca-b87b-a961ef7b745c" Volume is already exclusively attached to one node and can't be attached to another and the volumeattachment still pointed to the old node

I don't see any log from the ebs prestop hook in the kubelet logs ( added the grepped logs for ebs at the bottom ) I don't see the ebs pod go into terminate ever

[root@ip-10-1-1-116 ~]# journalctl -u kubelet -f
...
...

Mar 22 16:30:58 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:30:58.615431    1904 scope.go:115] "RemoveContainer" containerID="8f79c502f8bcc0b7aabbead15947e3f3777d2400dd5fe6fd869fd2d5597995ee"
Mar 22 16:31:06 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:06.752227    1904 kubelet.go:2224] "SyncLoop DELETE" source="api" pods="[ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0]"
Mar 22 16:31:11 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:11.616133    1904 scope.go:115] "RemoveContainer" containerID="8f79c502f8bcc0b7aabbead15947e3f3777d2400dd5fe6fd869fd2d5597995ee"
Mar 22 16:31:12 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:12.617083    1904 kubelet.go:2312] "SyncLoop (probe)" probe="readiness" status="" pod="ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0"
Mar 22 16:31:22 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:22.615312    1904 scope.go:115] "RemoveContainer" containerID="8f79c502f8bcc0b7aabbead15947e3f3777d2400dd5fe6fd869fd2d5597995ee"
Mar 22 16:31:25 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:25.363057    1904 kuberuntime_container.go:709] "Killing container with a grace period" pod="ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0" podUID=3be54537-a4eb-4821-b6b6-83b18151da79 containerName="scraper-container" containerID="containerd://308e34c6dde2613473579be9a50bb4d9e890ebe63bdf0a9b10ee0eeab35e04ee" gracePeriod=3600
Mar 22 16:31:25 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:25.499261    1904 generic.go:332] "Generic (PLEG): container finished" podID=3be54537-a4eb-4821-b6b6-83b18151da79 containerID="308e34c6dde2613473579be9a50bb4d9e890ebe63bdf0a9b10ee0eeab35e04ee" exitCode=2
Mar 22 16:31:25 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:25.499301    1904 kubelet.go:2240] "SyncLoop (PLEG): event for pod" pod="ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0" event=&{ID:3be54537-a4eb-4821-b6b6-83b18151da79 Type:ContainerDied Data:308e34c6dde2613473579be9a50bb4d9e890ebe63bdf0a9b10ee0eeab35e04ee}
Mar 22 16:31:37 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:37.615826    1904 scope.go:115] "RemoveContainer" containerID="8f79c502f8bcc0b7aabbead15947e3f3777d2400dd5fe6fd869fd2d5597995ee"
Mar 22 16:31:37 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:37.904856    1904 kuberuntime_container.go:709] "Killing container with a grace period" pod="ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0" podUID=3be54537-a4eb-4821-b6b6-83b18151da79 containerName="c-yellow-ig-25-server" containerID="containerd://fabf4f8b93ecdb567b7483574643152cff342d04cfee093e95b95e5c345366b6" gracePeriod=3600
Mar 22 16:31:38 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:38.521327    1904 generic.go:332] "Generic (PLEG): container finished" podID=3be54537-a4eb-4821-b6b6-83b18151da79 containerID="fabf4f8b93ecdb567b7483574643152cff342d04cfee093e95b95e5c345366b6" exitCode=0
Mar 22 16:31:38 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:38.521370    1904 kubelet.go:2240] "SyncLoop (PLEG): event for pod" pod="ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0" event=&{ID:3be54537-a4eb-4821-b6b6-83b18151da79 Type:ContainerDied Data:fabf4f8b93ecdb567b7483574643152cff342d04cfee093e95b95e5c345366b6}
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254006    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"clickhouse-server-users-d-volume\" (UniqueName: \"kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-users-d-volume\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254053    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"clickhouse-server-certificates-volume\" (UniqueName: \"kubernetes.io/secret/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-certificates-volume\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254134    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"ch-storage-volume\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254161    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"kube-api-access-gg5x8\" (UniqueName: \"kubernetes.io/projected/3be54537-a4eb-4821-b6b6-83b18151da79-kube-api-access-gg5x8\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254185    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/3be54537-a4eb-4821-b6b6-83b18151da79-aws-iam-token\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254209    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"ch-cache-volume\" (UniqueName: \"kubernetes.io/host-path/3be54537-a4eb-4821-b6b6-83b18151da79-ch-cache-volume\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: W0322 16:31:39.254209    1904 empty_dir.go:525] Warning: Failed to clear quota on /var/lib/kubelet/pods/3be54537-a4eb-4821-b6b6-83b18151da79/volumes/kubernetes.io~configmap/clickhouse-server-users-d-volume: clearQuota called, but quotas disabled
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254232    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"clickhouse-server-scripts-volume\" (UniqueName: \"kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-scripts-volume\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254284    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/3be54537-a4eb-4821-b6b6-83b18151da79-ch-cache-volume" (OuterVolumeSpecName: "ch-cache-volume") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "ch-cache-volume". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254352    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"clickhouse-server-config-d-volume\" (UniqueName: \"kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-config-d-volume\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\"(UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: W0322 16:31:39.254345    1904 empty_dir.go:525] Warning: Failed to clear quota on /var/lib/kubelet/pods/3be54537-a4eb-4821-b6b6-83b18151da79/volumes/kubernetes.io~configmap/clickhouse-server-scripts-volume: clearQuota called, but quotas disabled
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254396    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"podinfo\" (UniqueName: \"kubernetes.io/downward-api/3be54537-a4eb-4821-b6b6-83b18151da79-podinfo\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254404    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-users-d-volume" (OuterVolumeSpecName: "clickhouse-server-users-d-volume") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "clickhouse-server-users-d-volume". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: W0322 16:31:39.254447    1904 empty_dir.go:525] Warning: Failed to clear quota on /var/lib/kubelet/pods/3be54537-a4eb-4821-b6b6-83b18151da79/volumes/kubernetes.io~configmap/clickhouse-server-config-d-volume: clearQuota called, but quotas disabled
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254481    1904 reconciler_common.go:295] "Volume detached for volume \"clickhouse-server-users-d-volume\" (UniqueName: \"kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-users-d-volume\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254511    1904 reconciler_common.go:295] "Volume detached for volume \"ch-cache-volume\" (UniqueName: \"kubernetes.io/host-path/3be54537-a4eb-4821-b6b6-83b18151da79-ch-cache-volume\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254540    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-scripts-volume" (OuterVolumeSpecName: "clickhouse-server-scripts-volume") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "clickhouse-server-scripts-volume". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.255008    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-config-d-volume" (OuterVolumeSpecName: "clickhouse-server-config-d-volume") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "clickhouse-server-config-d-volume". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.255796    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/3be54537-a4eb-4821-b6b6-83b18151da79-kube-api-access-gg5x8" (OuterVolumeSpecName: "kube-api-access-gg5x8") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "kube-api-access-gg5x8". PluginName "kubernetes.io/projected", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.255805    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-certificates-volume" (OuterVolumeSpecName: "clickhouse-server-certificates-volume") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "clickhouse-server-certificates-volume". PluginName "kubernetes.io/secret", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.255869    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/3be54537-a4eb-4821-b6b6-83b18151da79-aws-iam-token" (OuterVolumeSpecName: "aws-iam-token") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "aws-iam-token". PluginName "kubernetes.io/projected", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.255896    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/downward-api/3be54537-a4eb-4821-b6b6-83b18151da79-podinfo" (OuterVolumeSpecName: "podinfo") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "podinfo". PluginName "kubernetes.io/downward-api", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.258987    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a" (OuterVolumeSpecName: "ch-storage-volume") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "pvc-15457627-f532-41ca-b87b-a961ef7b745c". PluginName "kubernetes.io/csi", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355390    1904 reconciler_common.go:288] "operationExecutor.UnmountDevice started for volume \"pvc-15457627-f532-41ca-b87b-a961ef7b745c\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355420    1904 reconciler_common.go:295] "Volume detached for volume \"kube-api-access-gg5x8\" (UniqueName: \"kubernetes.io/projected/3be54537-a4eb-4821-b6b6-83b18151da79-kube-api-access-gg5x8\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355435    1904 reconciler_common.go:295] "Volume detached for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/3be54537-a4eb-4821-b6b6-83b18151da79-aws-iam-token\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355451    1904 reconciler_common.go:295] "Volume detached for volume \"clickhouse-server-scripts-volume\" (UniqueName: \"kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-scripts-volume\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355465    1904 reconciler_common.go:295] "Volume detached for volume \"clickhouse-server-config-d-volume\" (UniqueName: \"kubernetes.io/configmap/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-config-d-volume\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355478    1904 reconciler_common.go:295] "Volume detached for volume \"podinfo\" (UniqueName: \"kubernetes.io/downward-api/3be54537-a4eb-4821-b6b6-83b18151da79-podinfo\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355490    1904 reconciler_common.go:295] "Volume detached for volume \"clickhouse-server-certificates-volume\" (UniqueName: \"kubernetes.io/secret/3be54537-a4eb-4821-b6b6-83b18151da79-clickhouse-server-certificates-volume\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath\"\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.375856    1904 operation_generator.go:1018] UnmountDevice succeeded for volume "pvc-15457627-f532-41ca-b87b-a961ef7b745c" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a") on node "ip-10-1-1-116.us-west-2.compute.internal"
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.456346    1904 reconciler_common.go:295] "Volume detached for volume \"pvc-15457627-f532-41ca-b87b-a961ef7b745c\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"csi-8dcd123839e58f5246dc005b28f5783c0c42a4f2d3fc6411bdcae42d51fb5ebd\""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.524049    1904 kubelet.go:2240] "SyncLoop (PLEG): event for pod" pod="ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0" event=&{ID:3be54537-a4eb-4821-b6b6-83b18151da79 Type:ContainerDied Data:1df7f09be93bcfbfcd6e55d5056f31120a16f89c65a1dfa2eec595b01c90472b}
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.524091    1904 scope.go:115] "RemoveContainer" containerID="fabf4f8b93ecdb567b7483574643152cff342d04cfee093e95b95e5c345366b6"
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.529616    1904 scope.go:115] "RemoveContainer" containerID="308e34c6dde2613473579be9a50bb4d9e890ebe63bdf0a9b10ee0eeab35e04ee"
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.559205    1904 kubelet.go:2224] "SyncLoop DELETE" source="api" pods="[ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0]"
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.573266    1904 kubelet.go:2218] "SyncLoop REMOVE" source="api" pods="[ns-yellow-ig-25/c-yellow-ig-25-server-43l8h6r-0]"
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.618419    1904 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=3be54537-a4eb-4821-b6b6-83b18151da79 path="/var/lib/kubelet/pods/3be54537-a4eb-4821-b6b6-83b18151da79/volumes"
Hangup
sh-4.2$
sh-4.2$
sh-4.2$

 λ \cat /dev/shm/logs.txt | grep -i ebs    
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.254134    1904 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"ch-storage-volume\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a\") pod \"3be54537-a4eb-4821-b6b6-83b18151da79\" (UID: \"3be54537-a4eb-4821-b6b6-83b18151da79\") "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.258987    1904 operation_generator.go:904] UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a" (OuterVolumeSpecName: "ch-storage-volume") pod "3be54537-a4eb-4821-b6b6-83b18151da79" (UID: "3be54537-a4eb-4821-b6b6-83b18151da79"). InnerVolumeSpecName "pvc-15457627-f532-41ca-b87b-a961ef7b745c". PluginName "kubernetes.io/csi", VolumeGidValue ""
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.355390    1904 reconciler_common.go:288] "operationExecutor.UnmountDevice started for volume \"pvc-15457627-f532-41ca-b87b-a961ef7b745c\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" "
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.375856    1904 operation_generator.go:1018] UnmountDevice succeeded for volume "pvc-15457627-f532-41ca-b87b-a961ef7b745c" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a") on node "ip-10-1-1-116.us-west-2.compute.internal"
Mar 22 16:31:39 ip-10-1-1-116.us-west-2.compute.internal kubelet[1904]: I0322 16:31:39.456346    1904 reconciler_common.go:295] "Volume detached for volume \"pvc-15457627-f532-41ca-b87b-a961ef7b745c\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0a500e7456c1b7b6a\") on node \"ip-10-1-1-116.us-west-2.compute.internal\" DevicePath \"csi-8dcd123839e58f5246dc005b28f5783c0c42a4f2d3fc6411bdcae42d51fb5ebd\""

primeroz commented 7 months ago

@ConnorJC3 @alexandermarston my main concern right now is if i am the only one still facing the issue (because maybe some custom config on my side is still causing it ) or if you are able to reproduce it

Waiting to know from @levanlongktmt :pray:

At least from what i am doing it should be enough to reproduce it by

Eks cluster ( i am on 1.26.12 :thinking: ) with karpenter 0.35.2 and ebs 1.29.0 ( I install through helm chart from this repo )
Nodepool with nothing special ( i run on graviton bu i doubt that has anything to do with it )
A statefulset with PVC across multiple nodes in the nodepool

update an annotation on the nodepool like

➜ kubectl get nodepool testpool -o yaml | yq '.spec.template.metadata.annotations'
epoch-restart: "3"

Watch for the events

ConnorJC3 commented 7 months ago

my cluster using eks addon to setup ebs driver and the addon still in v1.28 😂 @ConnorJC3 is it safe if I just delete add on then re install via helm chart?

It should be safe, with one caveat: In the time between uninstalling the addon version and installing the Helm version, you won't be able to perform any volume-related operations (volumes will get stuck attempting to create/delete/attach/detach/etc). Normal operation should resume once the new version is installed.

Make sure that if you have any custom configuration that you wish to keep that you port it over to the Helm version.

Also note that we are currently working on releasing v1.29.0 as an EKS addon, expected to be released in all commercial regions by end of next week (2024-03-29).

ConnorJC3 commented 7 months ago

@ConnorJC3 @alexandermarston my main concern right now is if i am the only one still facing the issue (because maybe some custom config on my side is still causing it ) or if you are able to reproduce it

Waiting to know from @levanlongktmt 🙏

At least from what i am doing it should be enough to reproduce it by
* Eks cluster ( i am on 1.26.12 🤔  ) with `karpenter 0.35.2` and `ebs 1.29.0` ( I install through helm chart from this repo )

* Nodepool with nothing special ( i run on graviton bu i doubt that has anything to do with it )

* A statefulset with PVC across multiple nodes in the nodepool

* update an annotation on the nodepool like
➜ kubectl get nodepool testpool -o yaml | yq '.spec.template.metadata.annotations'
epoch-restart: "3"
* Watch for the events

Just to make sure - have you enabled graceful termination for the kubelet on your nodes? Assuming you are using the EKS optimized AMIs that setting isn't enabled by default (EKS is working on it here: https://github.com/awslabs/amazon-eks-ami/pull/1544), we have instructions in our FAQ to configure the EKS AMI to do so (see step 2 in this section: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-steps-can-be-taken-to-mitigate-this-issue)

primeroz commented 7 months ago

Just to make sure - have you enabled graceful termination for the kubelet on your nodes?

Glad you made sure, cause no i did not know about that :pray: I will on monday first thing and report back but i would say that explains it

thanks for your patience !

levanlongktmt commented 7 months ago

I decided just keep using add-ons so I don't need to check and update helm manually, Mar 29 is fine with me. I will test again and update result when new version of add-on released (our cluster is in eu-central-1)

primeroz commented 7 months ago

@ConnorJC3 @alexandermarston following up from last week, i enabled Grace Shutdown this morning and did some more testing

TLDR: It seems to me that , at least in my case, when the prestop hook is run inside the ebs-csi-node pod the node has already been removed from the etcd store and because of that the prestop hook fails with an error node not found

This make sense since the shutdown manager is triggered by the node being shutdown on the cloud side and the node object is removed right after that. I can see the node vanish from _kubectl get node_ as soon as the shutdown is triggered

Trigger a node replacement using karpetenter drift management by updating an annotation on the nodepool spec.template.metadata

The kubelet logs are fetched through AWS Session Manager, they stream until the instance is actually terminated

Node has graceful shutdown enabled as described in the FAQ

[root@ip-10-1-7-208 /]# journalctl  -u kubelet | grep -i shutdown
Mar 25 11:13:51 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:13:51.507578    1754 nodeshutdown_manager_linux.go:137] "Creating node shutdown manager" shutdownGracePeriodRequested="45s" shutdownGracePeriodCriticalPods="15s" shutdownGracePeriodByPodPriority=[{Priority:0 ShutdownGracePeriodSeconds:30} {Priority:2000000000 ShutdownGracePeriodSeconds:15}]

1. When a node is replaced by karpenter drift manager the pod is evicted successfully, and the volume is unmounted and detached

0s          Normal    Evicted                     pod/c-yellow-ig-25-server-pvcv2mf-0         Evicted pod

Mar 25 11:24:39 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:39.817968    1754 kuberuntime_container.go:709] "Killing container with a grace period" pod="ns-yellow-ig-25/c-yellow-ig-25-server-pvcv2mf-0" podUID=b75386ed-40e0-4614-a552-6ddeb8b195da containerName="c-yellow-ig-25-server" containerID="containerd://09a8f4198773fc345c35bb035003ef266bdd6bb89b2ae4e2354951423cf498c6" gracePeriod=3600

Mar 25 11:24:40 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:40.933344    1754 generic.go:332] "Generic (PLEG): container finished" podID=b75386ed-40e0-4614-a552-6ddeb8b195da containerID="a5209c778ba56bba7c9770e97e7b9a763f634481e9cca0bc6f4b183c11e00dc2" exitCode=2
Mar 25 11:24:40 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:40.933371    1754 generic.go:332] "Generic (PLEG): container finished" podID=b75386ed-40e0-4614-a552-6ddeb8b195da containerID="09a8f4198773fc345c35bb035003ef266bdd6bb89b2ae4e2354951423cf498c6" exitCode=0

Mar 25 11:24:40 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:40.933452    1754 scope.go:115] "RemoveContainer" containerID="09a8f4198773fc345c35bb035003ef266bdd6bb89b2ae4e2354951423cf498c6"
Mar 25 11:24:40 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:40.939224    1754 scope.go:115] "RemoveContainer" containerID="a5209c778ba56bba7c9770e97e7b9a763f634481e9cca0bc6f4b183c11e00dc2"
Mar 25 11:24:40 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:40.944934    1754 scope.go:115] "RemoveContainer" containerID="09a8f4198773fc345c35bb035003ef266bdd6bb89b2ae4e2354951423cf498c6"

Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.086647    1754 reconciler_common.go:295] "Volume detached for volume \"clickhouse-server-config-d-volume\" (UniqueName: \"kubernetes.io/configmap/b75386ed-40e0-4614-a552-6ddeb8b195da-clickhouse-server-config-d-volume\") on node \"ip-10-1-7-208.us-west-2.compute.internal\" DevicePath \"\""
Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.115622    1754 operation_generator.go:1018] UnmountDevice succeeded for volume "pvc-9b50cf31-5a24-4783-8039-a362ad2a7c0d" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-09d0cbc269a5e3c66") on node "ip-10-1-7-208.us-west-2.compute.internal"
Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.186926    1754 reconciler_common.go:295] "Volume detached for volume \"pvc-9b50cf31-5a24-4783-8039-a362ad2a7c0d\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-09d0cbc269a5e3c66\") on node \"ip-10-1-7-208.us-west-2.compute.internal\" DevicePath \"csi-ab4d8239111079de83d0e307498a5076d263aa3c8e12e1272bc34ab67a424568\""

Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.253838    1754 kubelet.go:2224] "SyncLoop DELETE" source="api" pods="[ns-yellow-ig-25/c-yellow-ig-25-server-pvcv2mf-0]"
Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.261334    1754 kubelet.go:2218] "SyncLoop REMOVE" source="api" pods="[ns-yellow-ig-25/c-yellow-ig-25-server-pvcv2mf-0]"

2. Node Shutdown manager takes over, i assume through karpenter deleting the cloud instance

The node disappear from kubectl get ndoe at the same time the shutdown event is detected

Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.967523    1754 nodeshutdown_manager_linux.go:265] "Shutdown manager detected new shutdown event, isNodeShuttingDownNow" event=true
Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.967543    1754 nodeshutdown_manager_linux.go:273] "Shutdown manager detected new shutdown event" event="shutdown"
t="shutdown"
Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:41.967567    1754 nodeshutdown_manager_linux.go:322] "Shutdown manager processing shutdown event"

3. Here kubelet is already failing toget the node from apiserver

Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: E0325 11:24:41.968804    1754 kubelet_node_status.go:545] "Error updating node status, will retry" err="error getting node \"ip-10-1-7-208.us-west-2.compute.internal\": nodes \"ip-10-1-7-208.us-west-2.compute.internal\" not found"
Mar 25 11:24:41 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: E0325 11:24:41.971332    1754 kubelet_node_status.go:545] "Error updating node status, will retry" err="error getting node \"ip-10-1-7-208.us-west-2.compute.internal\": nodes \"ip-10-1-7-208.us-west-2.compute.internal\" not found"

4. all the daemonsets pods start to get terminated , including ebs-csi-driver

Mar 25 11:24:42 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:42.973727    1754 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/ebs-csi-node-rl95f" gracePeriod=15
Mar 25 11:24:42 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:42.973742    1754 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/cilium-6cd2t" gracePeriod=1

Mar 25 11:24:42 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:42.974764    1754 kuberuntime_container.go:709] "Killing container with a grace period" pod="kube-system/ebs-csi-node-rl95f" podUID=0ad30749-c716-4c7d-846a-a786b08d13b7 containerName="node-driver-registrar" containerID="containerd://58bd98c6de45d5774f10967525996bf73f4932e6ad48b60569bde824c5a8d31a" gracePeriod=15

### Cilium is killed 
Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:43.007024    1754 kuberuntime_container.go:709] "Killing container with a grace period" pod="kube-system/cilium-6cd2t" podUID=cb194ca6-d9d3-48b0-9862-2d3cdaadd1f7 containerName="cilium-agent" containerID="containerd://a13f2b2f4ca37ebfa60119534aefd6fc4ad4094c4c38b05fb51fa3788d7f0ae7" gracePeriod=1

5. EBS pre stop fails , this i because by the time the prestop hook is run the node object has already been removed from etcd store and the fetchNode fails

Increasing cilium cni terminationgraceperiod did not help , it defaults to 1 but it make sense that when terminating it does not flush the ebpf configuration or it would break nodes when there are upgrades to cilium and such.

also the error is about node xxxx not found and not having problems connecting to the apiserver

Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: E0325 11:24:43.038607    1754 handlers.go:78] "Exec lifecycle hook for Container in Pod failed" err=<
Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: command '/bin/aws-ebs-csi-driver pre-stop-hook' exited with 1: I0325 11:24:43.020810      29 prestop.go:32] "PreStop: executing PreStop lifecycle hook"
Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: E0325 11:24:43.035020      29 options.go:89] "failed to execute PreStop lifecycle hook" err="fetchNode: failed to retrieve node information: nodes \"ip-10-1-7-208.us-west-2.compute.internal\" not found"
Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: > execCommand=[/bin/aws-ebs-csi-driver pre-stop-hook] containerName="ebs-plugin" pod="kube-system/ebs-csi-node-rl95f" message=<

Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: I0325 11:24:43.020810      29 prestop.go:32] "PreStop: executing PreStop lifecycle hook"
Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: E0325 11:24:43.035020      29 options.go:89] "failed to execute PreStop lifecycle hook" err="fetchNode: failed to retrieve node information: nodes \"ip-10-1-7-208.us-west-2.compute.internal\" not found"

Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: E0325 11:24:43.038642    1754 kuberuntime_container.go:596] "PreStop hook failed" err=<
Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: command '/bin/aws-ebs-csi-driver pre-stop-hook' exited with 1: I0325 11:24:43.020810      29 prestop.go:32] "PreStop: executing PreStop lifecycle hook"
Mar 25 11:24:43 ip-10-1-7-208.us-west-2.compute.internal kubelet[1754]: E0325 11:24:43.035020      29 options.go:89] "failed to execute PreStop lifecycle hook" err="fetchNode: failed to retrieve node information: nodes \"ip-10-1-7-208.us-west-2.compute.internal\" not found"

And i get the multi attach error with 6m timeout

0s          Warning   FailedAttachVolume          pod/c-yellow-ig-25-server-pvcv2mf-0         Multi-Attach error for volume "pvc-9b50cf31-5a24-4783-8039-a362ad2a7c0d" Volume is already exclusively attached to one node and can't be attached 
to another

Update For reference , i matched the logs from kubelet ( another node replacement event ) and AWS Logs Insights and i can see the same story.

By the time the ebs csi node pod termination starts through the kubelet shutdown manager the node is already gone from etcd and so the prestop hook fails.

It seems to me that what should be happening is that this 4 evictions, including the critical daemonsets should trigger the ebs-csi-node prestop hook.

At least on my tests I never see an eviction to any system-critical daemonset pod in any log , not in kubelet not in Cloudwatch.

16:35:04 - Repalcement of node ip-10-1-9-107.us-west-2.compute.internal starts
16:35:03 - AWS Logs - Karpenter user added taints to node ip-10-1-9-107.us-west-2.compute.internal - karpenter.sh/disruption=disrupting
16:35:30 - AWS Logs - new node ip-10-1-11-3.us-west-2.compute.internal is created and available
16:35:49 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf
16:35:49 - AWS Logs - Karpenter user delete node on ip-10-1-9-107.us-west-2.compute.internal
16:35:49 - AWS Logs - Karpenter user label node.kubernetes.io/exclude-from-external-load-balancers=karpenter to node
16:35:50 - AWS Logs - Karpenter user create eviction for pod c-yellow-ig-25-server-hbzgsvd-0
16:35:51 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf
16:35:52 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf
16:35:53 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf
16:35:54 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf
16:35:55 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf - The nodeclaim keep getting a delete every second as part of the finalizer ?
16:36:04 - Kubelet Logs - ip-10-1-9-107.us-west-2.compute.internal - "Killing container with a grace period" pod="ns-yellow-ig-25/c-yellow-ig-25-server-hbzgsvd-0"
16:36:04 - AWS_Logs - system:node:ip-10-1-9-107.us-west-2.compute.internal delete pod c-yellow-ig-25-server-hbzgsvd-0
16:36:04 - AWS_Logs - Binding event for pod c-yellow-ig-25-server-hbzgsvd-0 on target ip-10-1-11-3.us-west-2.compute.internal
16:36:04 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf
16:36:04 - AWS Logs - Node Status update for ip-10-1-9-107.us-west-2.compute.internal Successful
16:36:04 - AWS Logs - system:serviceaccount:kube-system:attachdetach-controller patch node ip-10-1-9-107 with {"status":{"volumesAttached":null}
16:36:05 - AWS Logs - Karpenter user remove finalizer from node ip-10-1-9-107
16:36:05 - AWS Logs - Karpenter user delete event on nodeclaim mixed-memory-ssd-arm64-wjkmf - Last Ever
16:36:05 - AWS Logs - system:serviceaccount:kube-system:generic-garbage-collector delete csinodes , ciliumnodes for ip-10-1-9-107
16:36:05 - AWS Logs - Karpenter try to patch node ip-10-1-9-107 and get a Failure back node not found - attempt to remove finalizer from node again ?
16:36:05 - As of this time the Node object ip-10-1-9-107 cannot be found in etcd through apiserver anymore
16:36:05 - Kubelet Logs - ip-10-1-9-107.us-west-2.compute.internal - "Shutdown manager detected new shutdown event, isNodeShuttingDownNow" event=true
16:36:05 - Kubelet Logs - ip-10-1-9-107.us-west-2.compute.internal - "Shutdown manager detected new shutdown event" event="shutdown"
16:36:05 - Kubelet Logs - ip-10-1-9-107.us-west-2.compute.internal - "Shutdown manager detected new shutdown event" event="shutdown"
16:36:05 - Kubelet Logs - ip-10-1-9-107.us-west-2.compute.internal - "Error updating node status, will retry" err="error getting node \"ip-10-1-9-107.us-west-2.compute.internal\": nodes \"ip-10-1-9-107.us-west-2.compute.internal\" not f ound"
16:36:05 - Kubelet Logs - ip-10-1-9-107 - "Unable to update node status" err="update node status exceeds retry count"
16:36:06 - Kubelet Logs - ip-10-1-9-107 - "Shutdown manager killing pod with gracePeriod" pod="kube-system/ebs-csi-node-7bk9k" gracePeriod=15
16:36:07 - AWS Logs - system:serviceaccount:kube-system:ebs-csi-node-sa get Node ip-10-1-9-107 Failure odes "ip-10-1-9-107.us-west-2.compute.internal" not found
16:36:07 - Kubelet Logs - ip-10-1-9-107 - "PreStop: executing PreStop lifecycle hook"
16:36:07 - Kubelet Logs - ip-10-1-9-107 - "Exec lifecycle hook for Container in Pod failed" err=<command '/bin/aws-ebs-csi-driver pre-stop-hook' exited with 1
16:36:07 - Kubelet Logs - ip-10-1-9-107 - "failed to execute PreStop lifecycle hook" err="fetchNode: failed to retrieve node information: nodes \"ip-10-1-9-107.us-west-2.compute.internal\" not found"
16:36:21 - Kubelet Logs - ip-10-1-9-107.us-west-2 - "Shutdown manager pod killing time out" gracePeriod=15 priority=2000000000
16:36:21 - Kubelet Logs - ip-10-1-9-107.us-west-2 - "Shutdown manager completed processing shutdown event, node will shutdown shortly"

primeroz commented 7 months ago

@levanlongktmt Honestly at this point i will just wait and see what your experience is when you get to upgrade the addon

levanlongktmt commented 7 months ago

@primeroz just some information, I didn't use kubectl delete node to kill a node. All my nodes is spot instance so I kill node by go to spot request then choose initiate interuption to simulate aws kill ec2

primeroz commented 7 months ago

@levanlongktmt thanks.

I am not using kubectl delete node either anymore. I am changing the specs of node pools to let karpenter replace nodes with drift manager. Similar to what an Ami upgrade would be.

Even though it seems to me that deleting a node or a node claim goes through the same codepath in karpenter.

I might try to do what you are doing just to check if there is any difference in behaviour.

primeroz commented 7 months ago

To try and exclude the many variables my dev environment might be injecting i built a simple POC - https://github.com/primeroz/eks-karpenter-ebs-poc

This is a barebone EKS cluster ( 1.26 to match my current version ) with EBS And Karpenter and nothing more.

2 simple statefulsets , same multi attach error

I am planning to test more tomorrow , maybe upgrade eks cluster versions, try the spot instance usecase and such.

primeroz commented 7 months ago

Using the POC above i did do a full run of tests on EKS 1.26 to 1.29 and all of them fail with the same problem https://github.com/primeroz/eks-karpenter-ebs-poc/blob/main/RESULTS.md

Setup is

EKS Standard Configuration
NodePool / NodeClass with enabled gracefule shutdown
EBS Csi Driver installed from helm chart 2.29.0
Karpenter installed from helm chart 0.35.2

The test is a

statefulset with 1 node

All type of karpenter node replacement failed with the Multi Attach Error and 6 minute wait

Node consolidation
Drift Manager on nodepool spec
Spot Instance Termination through SQS notification
Kubectl delete of nodeclaim and node using karpenter finalizer

@ConnorJC3 @alexandermarston This should be easy to reproduce with the code above that only need terraform and helm , happy to do any further test if there is anything else you think i should do or any other feature i should enable in order to make this work

:pray:

levanlongktmt commented 7 months ago

Some update from my side, I uninstalled addon and use helm to install v1.29 and still face multi-attach error. Karpenter version is v0.35.2

Here is Nodeclass and NodePool

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: experiment
spec:
  amiFamily: AL2023
  role: "xxxxxx"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster-name"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster-name"
  userData: |
    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    systemctl restart kubelet
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: elasticsearch
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
      nodeClassRef:
        name: experiment
  limits:
    cpu: 20
    memory: 100Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

I also just do trigger spot instance interruption and get multi-attach error randomly.

primeroz commented 7 months ago

  userData: |
    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    systemctl restart kubelet

you don't really need the restart of kubelet there since this userData is run before the eks-bootstrap.sh script that configures other bits of kubelet and then start it ( I actually don't think kubelet is even started when you get there )

igor-nikiforov commented 7 months ago

Have same issue with EKS 1.28, EBS CSI Driver 1.29.0 and Karpenter 0.35.2

primeroz commented 7 months ago

For what is worth , i am getting more and more convinced this is a karpenter problem that needs to be addressed there.

in particular the logic where a node gets removes as soon as the delete signal is sent from karpenter , this is something that the karpenter dev team is thinking of changing

ghost commented 7 months ago

I have the same issue as well with EKS 1.27, EBS CSI Driver 1.29.1 and Karpenter 0.32.7.

Another way this could potentially be fixed would be for the ebs driver's prestop hook to check the error it receives when it tries to get the node. Currently if they get any error when trying to get the node, they just exit out. They should check if the error is that the node doesn't exist and in that case proceed with the volume attachment cleanup.

jonathan-innis commented 7 months ago

Update here: This change merged here (https://github.com/kubernetes-sigs/karpenter/pull/1164) should now wait for the instance termination to fully complete before it goes and removes the node. This should ideally completely eliminate this issue. We're targeting v0.37.0 for release of this change (a couple versions out, hopefully out in the next few weeks or so).

jonathan-innis commented 7 months ago

Linking the relevant Karpenter issue for this one: https://github.com/kubernetes-sigs/karpenter/issues/944. Feel free to add some comments/discussion over there as well.

kubernetes-sigs / aws-ebs-csi-driver

Volume still hang on Karpenter Node Consolidation/Termination #1955