longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.96k stars 588 forks source link

[BUG] rpc error: code = DeadlineExceeded when i try to attach any volumes on my kubernetes cluster #5990

Open Anddiy opened 1 year ago

Anddiy commented 1 year ago

Describe the bug (🐛 if you encounter this issue)

Longhorn can't attach any volume to any node, i have longhorn installed on my kubernetes cluster, all the longhorn pods are running 1/1, but when i tried to install postgresql on my cluster, the pod are stucked on container creating status, the pod says that longhorn cant attach, saying the error: Warning FailedAttachVolume 9s attachdetach-controller AttachVolume.Attach failed for volume "pvc-f6a5a7b8-a053-4cf0-be54-5ae8141b4b0b" : rpc error: code = DeadlineExceeded desc = volume pvc-f6a5a7b8-a053-4cf0-be54-5ae8141b4b0b failed to attach to node vmrmwrknodeprod06

I have a kubernetes cluster with 9 nodes with sufficient space( the pvc needs 8gb i have 2 tb ) and i've already tried the solution of reducing replica count to 1 and i've already turned on the Replica Node Level Soft Anti-Affinity.

To Reproduce

Steps to reproduce the behavior: Install longhorn on a kubernetes cluster Try to attach any pvc using a storageClass created by this yaml:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nfs-storage
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "1440"
  fromBackup: ""
  fsType: "ext4"

Expected behavior

The pvc attached succesfully

Log or Support bundle

The only logs i've found is the pods logs, im trying to run a postgresql pod on my cluster, the pvc has been created succesfully as we can see here:

[15:43] root@vmrmmstnodeprod01 [longhorn]:# kubectl get pvc -n bmc-helix-platform
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-postgres-postgresql-ha-postgresql-0   Bound    pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198   8Gi        RWO            nfs-storage    17m

But in longhorn U.I, the pvc is stuck in attach and dettaching, i've found that this is happening because the postgresql pod are saying this error:

Warning  FailedAttachVolume  16s   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198" : rpc error: code = DeadlineExceeded desc = volume pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198 failed to attach to node vmrmwrknodeprod08

With that i've learned that the pvc is attaching and dettached because when this error occurs, the kubernetes kill the pod and recreated it on another node, so longhorn dettach the node and try to attach again, but when it try to attach, this errors occurs, creating that attach/dettach loop, don't know how to fix this.

Environment

Additional context

Add any other context about the problem here.

innobead commented 1 year ago

@Anddiy Please provide the support bundle. Thanks.

Anddiy commented 1 year ago

@innobead , The only logs i've found is the pods logs, im trying to run a postgresql pod on my cluster, the pvc has been created succesfully as we can see here:

[15:43] root@vmrmmstnodeprod01 [longhorn]:# kubectl get pvc -n bmc-helix-platform
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-postgres-postgresql-ha-postgresql-0   Bound    pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198   8Gi        RWO            nfs-storage    17m

But in longhorn U.I, the pvc is stuck in attach and dettaching, i've found that this is happening because the postgresql pod are saying this error:

Warning  FailedAttachVolume  16s   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198" : rpc error: code = DeadlineExceeded desc = volume pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198 failed to attach to node vmrmwrknodeprod08

With that i've learned that the pvc is attaching and dettached because when this error occurs, the kubernetes kill the pod and recreated it on another node, so longhorn dettach the node and try to attach again, but when it try to attach, this errors occurs, creating that attach/dettach loop, don't know how to fix this.

Anddiy commented 1 year ago

@innobead , heres the support bundle:

supportbundle_d0c1d097-f6a4-4bf9-90b1-89af53b33d09_2023-05-26T19-33-20Z.zip

ejweber commented 1 year ago

Thanks @Anddiy. Haven't yet had time to fully look through the provided bundle, but we're seeing a lot of iSCSI related errors related to that volume on multiple nodes. For example:

2023-05-26T19:09:55.626600169Z time="2023-05-26T19:09:55Z" level=warning msg="pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198-e-a261d1ed: time=\"2023-05-26T19:09:31Z\" level=warning msg=\"Failed to discover\" error=\"failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net iscsiadm -m discovery -t sendtargets -p 10.42.121.235], output , stderr Failed to connect to bus: No data available\\niscsiadm: can not connect to iSCSI daemon (111)!\\nFailed to connect to bus: No data available\\niscsiadm: can not connect to iSCSI daemon (111)!\\niscsiadm: Cannot perform discovery. Initiatorname required.\\niscsiadm: Could not perform SendTargets discovery: could not connect to iscsid\\n: exit status 20\""

Can you check that the iscsid service is healthy on your nodes?

One issue that included similar errors was https://github.com/longhorn/longhorn/issues/5381.

ejweber commented 1 year ago

Maybe you forgot to install open-iscsi in this cluster before deploying Longhorn? If not, we can continue to help troubleshoot.

https://github.com/longhorn/longhorn/issues/3507#issuecomment-1006831026

https://longhorn.io/docs/1.5.0/deploy/install/#installation-requirements

Anddiy commented 1 year ago

Thanks for the info @ejweber , i will check that on my nodes

Anddiy commented 1 year ago

@ejweber , i've checked here and i already have openiscsi installed on my nodes:

[16:35] root@vmrmwrknodeprod03 [~]:# yum install iscsi-initiator-utils Last metadata expiration check: 5:27:42 ago on Mon 29 May 2023 11:08:02 AM -03. Package iscsi-initiator-utils-6.2.1.4-4.git095f59c.el8.x86_64 is already installed. Dependencies resolved. Nothing to do. Complete!

ejweber commented 1 year ago

Thanks @Anddiy. Can you also check that the service is running? That was the issue in https://github.com/longhorn/longhorn/issues/3507#issuecomment-1006831026.

tejassp919 commented 1 year ago

@ejweber I am facing the same FailedAttachVolume issue with minor difference in confs for storage-class:

numberOfReplicas: "2"
staleReplicaTimeout: "30"

I have checked that in my case iscsid.service is up and running

GuillaumeDorschner commented 1 year ago

I have the same issue. Did you find out what it was?

ejweber commented 1 year ago

Ok, I'm not 100% sure the logs we have are indicative of this, but we have been seeing a number of iSCSI issues recently in Rocky related to an update to the container-selinux package. You can follow along at https://github.com/longhorn/longhorn/issues/5627.

If you are running an up-to-date version of Rocky/RHEL, please check:

rpm -q container-selinux
# Am I running greater than v2.189?

getenforce
# Am I running SELinux in enforcing mode?

ausearch -m AVC | grep iscsd_t
# Am I seeing `iscsid_t` and `dac_override` related SELinux denials?

If you can answer yes to all three questions, you are likely hitting that issue.

EDIT: There was a typo in one of the above commands that was misleading. I fixed it.

GuillaumeDorschner commented 1 year ago

@ejweber, I've checked the parameters you've mentioned. My version of container-selinux is v2.205.0 and my SELinux is running in enforcing mode. However, the ausearch -m AVC | grep iscsd_t command didn't return any results. Given these conditions, could this still be related to the iSCSI issues you mentioned, or might there be another potential cause? I obtained identical results from both my master and worker.

[root@master ~]# rpm -q container-selinux
container-selinux-2.205.0-2.module_el8.8.0+3557+7ba9cc13.noarch
[root@master ~]# getenforce
Enforcing
[root@master ~]# ausearch -m AVC | grep iscsd_t
[root@master ~]#
ejweber commented 1 year ago

There was a typo in my command, unfortunately. Try ausearch -m AVC | grep iscsid_t. I have updated the comment above to avoid confusing others. Just drop the grep if you still get nothing and/or look in journalctl for related messages.

Another quick test might be to fully detach the volume, then run setenforce Permissive on a node, then attach the volume to that node. If attaching works, you are almost certainly hitting #5627. You may or may not be "allowed" to do this on the system you are using.

I'm working on getting a KB out and a DaemonSet that can fix things automatically, but the medium term solution will essentially be to execute echo '(allow iscsid_t self (capability (dac_override)))' > local_longhorn.cil && semodule -vi local_longhorn.cil on every node to give iscsiadm a permission is currently requires to do its job.

GuillaumeDorschner commented 1 year ago

grep

type=SYSCALL msg=audit(1686131484.778:2573): arch=c000003e syscall=83 success=no exit=-13 a0=557457b665d0 a1=1f8 a2=fffffffffffffeb0 a3=7ffd86516cb7 items=0 ppid=1187339 pid=1187829 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
type=AVC msg=audit(1686131484.778:2573): avc:  denied  { dac_override } for  pid=1187829 comm="iscsiadm" capability=1  scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:system_r:iscsid_t:s0 tclass=capability permissive=0
type=SYSCALL msg=audit(1686138208.882:2989): arch=c000003e syscall=59 success=yes exit=0 a0=7ffdf6cdd1a0 a1=7ffdf6cdd3e0 a2=7ffdf6cdd3f8 a3=0 items=0 ppid=1341909 pid=1341947 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
type=AVC msg=audit(1686138208.882:2989): avc:  denied  { read } for  pid=1341947 comm="iscsiadm" path="/dev/null" dev="tmpfs" ino=7624775 scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:object_r:container_runtime_tmpfs_t:s0 tclass=chr_file permissive=0

journalctl

juin 08 17:08:45 master avahi-daemon[1309]: New relevant interface vethcd226a42.IPv6 for mDNS.
juin 08 17:08:45 master avahi-daemon[1309]: Registering new address record for fe80::2441:76ff:fe47:ca66 on vethcd226a42.*.
juin 08 17:08:45 master kubelet[627130]: I0608 17:08:45.984915  627130 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="longhorn-system/longhorn-csi-plugin-r7cb5" podStartSL>
juin 08 17:08:46 master avahi-daemon[1309]: Joining mDNS multicast group on interface vethf6cf4c46.IPv6 with address fe80::1c49:55ff:fe17:dfb3.
juin 08 17:08:46 master avahi-daemon[1309]: New relevant interface vethf6cf4c46.IPv6 for mDNS.
juin 08 17:08:46 master avahi-daemon[1309]: Registering new address record for fe80::1c49:55ff:fe17:dfb3 on vethf6cf4c46.*.
juin 08 17:08:46 master kubelet[627130]: I0608 17:08:46.382253  627130 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="longhorn-system/csi-snapshotter-588bd6f85d-bm8pq" pod>
juin 08 17:08:46 master kubelet[627130]: I0608 17:08:46.543372  627130 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: driver.longhorn.io endpoint: /var/lib/kube>
juin 08 17:08:46 master kubelet[627130]: I0608 17:08:46.543413  627130 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: driver.longhorn.io at endpoint: /var/lib/kubelet/plugins/>

Using setenforce Permissive did detach the volume, but in the end, I tried to use echo '(allow iscsid_t self (capability (dac_override)))' > local_longhorn.cil && semodule -vi local_longhorn.cil, but it did not work.

ejweber commented 1 year ago

type=AVC msg=audit(1686138208.882:2989): avc: denied { read } for pid=1341947 comm="iscsiadm" path="/dev/null" dev="tmpfs" ino=7624775 scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:object_r:container_runtime_tmpfs_t:s0 tclass=chr_file permissive=0

This one is a new one for me. I'll have to investigate. If anyone reading this also hits it, please chime in.

GuillaumeDorschner commented 1 year ago

Hello @ejweber, I was wondering if there have been any updates on the issue we discussed above. Could you please clarify what might be causing this issue and any potential solutions? If you need more information or details from my end to help identify or resolve this issue, please let me know what you need. Thank you in advance for your help.

GuillaumeDorschner commented 1 year ago

Hello @ejweber, I wanted to provide you with an update on my situation. After several attempts, I managed to get the volumes attached, but the process was far from seamless. I had to reset my entire cluster and perform a yum remove followed by a yum install for all package on my AlmaLinux. Now, things seem to be working.

However, I have been encountering a warning at the start of each pod launch:

Events:                                                                                                                                                                                                                                  
  Type     Reason                  Age                From                     Message                                                                                                                                                   
  ----     ------                  ----               ----                     -------                                                                                                                                                   
  Warning  FailedScheduling        56m (x2 over 56m)  default-scheduler        0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
  Normal   Scheduled               56m                default-scheduler        Successfully assigned default/keycloak-postgresql-0 to master                                                                                             
  Normal   SuccessfulAttachVolume  56m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-c1287895-8a07-4906-b507-9e26f24b7722"                                                                       
  Normal   Pulled                  55m                kubelet                  Container image "docker.io/bitnami/postgresql:11.14.0-debian-10-r28" already present on machine                                                           
  Normal   Created                 55m                kubelet                  Created container keycloak-postgresql                                                                                                                     
  Normal   Started                 55m                kubelet                  Started container keycloak-postgresql

In the longhorn dashboard the status is Degraded... image

I'm not sure why this is happening, but it suggests that there might still be an underlying issue. Could this be related to the original problem or is it something new? If you have any advice or thoughts.

ejweber commented 1 year ago

I am thinking the second SELinux denial I previously called out is/was a red herring. I have been unable to reproduce it (or any potential negative effects) in my environment, but will still be interested in more data on it in the future.

It looks like there may be resource constraints in the cluster, as that is usually the reason for preemption related messages. The "pod has unbound immediate PersistentVolumeClaims" is no big deal. It just means the scheduler got to the pod before Longhorn finished creating the volume. This is only a problem if it persists for some long period of time.

If the degraded volumes are a persistent issue, I recommend opening a new issue with some additional information, like:

We can get at all of this if you are able to upload a support bundle.

My first, best guess with limited information is that this is unrelated to the issues above and that some kind of resource pressure is preventing replicas from running.

robertwbl commented 3 months ago

I also ran into the same rpc error which turned out to be for a different reason but I'll share here for anyone who finds themselves looking at the same thing.

In my instance-manager logs I was getting: libkmod: kmod_module_insert_module: could not find module by name='iscsi_tcp'\niscsiadm: Could not insert module tcp. Kmod error -2\niscsiadm: iSCSI driver tcp is not loaded. Load the module then retry the command.\niscsiadm: Could not perform SendTargets discovery: iSCSI driver not found. Please make sure it is loaded, and retry the operation\n: exit status 12"

This was because my kernel didn't include the iscsi_tcp module, as detailed here: https://github.com/longhorn/longhorn/issues/1853

pengln commented 1 week ago

longhorn in kind. iscsid is error, so vpc can't mount. iscsid say "sendmsg: bug? ctrl_fd 5".