Open Anddiy opened 1 year ago
@Anddiy Please provide the support bundle. Thanks.
@innobead , The only logs i've found is the pods logs, im trying to run a postgresql pod on my cluster, the pvc has been created succesfully as we can see here:
[15:43] root@vmrmmstnodeprod01 [longhorn]:# kubectl get pvc -n bmc-helix-platform
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-postgres-postgresql-ha-postgresql-0 Bound pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198 8Gi RWO nfs-storage 17m
But in longhorn U.I, the pvc is stuck in attach and dettaching, i've found that this is happening because the postgresql pod are saying this error:
Warning FailedAttachVolume 16s attachdetach-controller AttachVolume.Attach failed for volume "pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198" : rpc error: code = DeadlineExceeded desc = volume pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198 failed to attach to node vmrmwrknodeprod08
With that i've learned that the pvc is attaching and dettached because when this error occurs, the kubernetes kill the pod and recreated it on another node, so longhorn dettach the node and try to attach again, but when it try to attach, this errors occurs, creating that attach/dettach loop, don't know how to fix this.
@innobead , heres the support bundle:
supportbundle_d0c1d097-f6a4-4bf9-90b1-89af53b33d09_2023-05-26T19-33-20Z.zip
Thanks @Anddiy. Haven't yet had time to fully look through the provided bundle, but we're seeing a lot of iSCSI related errors related to that volume on multiple nodes. For example:
2023-05-26T19:09:55.626600169Z time="2023-05-26T19:09:55Z" level=warning msg="pvc-2e771f20-7bfd-47de-bb1d-cee40d37f198-e-a261d1ed: time=\"2023-05-26T19:09:31Z\" level=warning msg=\"Failed to discover\" error=\"failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net iscsiadm -m discovery -t sendtargets -p 10.42.121.235], output , stderr Failed to connect to bus: No data available\\niscsiadm: can not connect to iSCSI daemon (111)!\\nFailed to connect to bus: No data available\\niscsiadm: can not connect to iSCSI daemon (111)!\\niscsiadm: Cannot perform discovery. Initiatorname required.\\niscsiadm: Could not perform SendTargets discovery: could not connect to iscsid\\n: exit status 20\""
Can you check that the iscsid service is healthy on your nodes?
One issue that included similar errors was https://github.com/longhorn/longhorn/issues/5381.
Maybe you forgot to install open-iscsi in this cluster before deploying Longhorn? If not, we can continue to help troubleshoot.
https://github.com/longhorn/longhorn/issues/3507#issuecomment-1006831026
https://longhorn.io/docs/1.5.0/deploy/install/#installation-requirements
Thanks for the info @ejweber , i will check that on my nodes
@ejweber , i've checked here and i already have openiscsi installed on my nodes:
[16:35] root@vmrmwrknodeprod03 [~]:# yum install iscsi-initiator-utils Last metadata expiration check: 5:27:42 ago on Mon 29 May 2023 11:08:02 AM -03. Package iscsi-initiator-utils-6.2.1.4-4.git095f59c.el8.x86_64 is already installed. Dependencies resolved. Nothing to do. Complete!
Thanks @Anddiy. Can you also check that the service is running? That was the issue in https://github.com/longhorn/longhorn/issues/3507#issuecomment-1006831026.
@ejweber I am facing the same FailedAttachVolume issue with minor difference in confs for storage-class:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
I have checked that in my case iscsid.service
is up and running
I have the same issue. Did you find out what it was?
Ok, I'm not 100% sure the logs we have are indicative of this, but we have been seeing a number of iSCSI issues recently in Rocky related to an update to the container-selinux
package. You can follow along at https://github.com/longhorn/longhorn/issues/5627.
If you are running an up-to-date version of Rocky/RHEL, please check:
rpm -q container-selinux
# Am I running greater than v2.189?
getenforce
# Am I running SELinux in enforcing mode?
ausearch -m AVC | grep iscsd_t
# Am I seeing `iscsid_t` and `dac_override` related SELinux denials?
If you can answer yes to all three questions, you are likely hitting that issue.
EDIT: There was a typo in one of the above commands that was misleading. I fixed it.
@ejweber, I've checked the parameters you've mentioned. My version of container-selinux is v2.205.0 and my SELinux is running in enforcing mode. However, the ausearch -m AVC | grep iscsd_t command didn't return any results. Given these conditions, could this still be related to the iSCSI issues you mentioned, or might there be another potential cause? I obtained identical results from both my master and worker.
[root@master ~]# rpm -q container-selinux
container-selinux-2.205.0-2.module_el8.8.0+3557+7ba9cc13.noarch
[root@master ~]# getenforce
Enforcing
[root@master ~]# ausearch -m AVC | grep iscsd_t
[root@master ~]#
There was a typo in my command, unfortunately. Try ausearch -m AVC | grep iscsid_t
. I have updated the comment above to avoid confusing others. Just drop the grep
if you still get nothing and/or look in journalctl
for related messages.
Another quick test might be to fully detach the volume, then run setenforce Permissive
on a node, then attach the volume to that node. If attaching works, you are almost certainly hitting #5627. You may or may not be "allowed" to do this on the system you are using.
I'm working on getting a KB out and a DaemonSet that can fix things automatically, but the medium term solution will essentially be to execute echo '(allow iscsid_t self (capability (dac_override)))' > local_longhorn.cil && semodule -vi local_longhorn.cil
on every node to give iscsiadm
a permission is currently requires to do its job.
grep
type=SYSCALL msg=audit(1686131484.778:2573): arch=c000003e syscall=83 success=no exit=-13 a0=557457b665d0 a1=1f8 a2=fffffffffffffeb0 a3=7ffd86516cb7 items=0 ppid=1187339 pid=1187829 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
type=AVC msg=audit(1686131484.778:2573): avc: denied { dac_override } for pid=1187829 comm="iscsiadm" capability=1 scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:system_r:iscsid_t:s0 tclass=capability permissive=0
type=SYSCALL msg=audit(1686138208.882:2989): arch=c000003e syscall=59 success=yes exit=0 a0=7ffdf6cdd1a0 a1=7ffdf6cdd3e0 a2=7ffdf6cdd3f8 a3=0 items=0 ppid=1341909 pid=1341947 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
type=AVC msg=audit(1686138208.882:2989): avc: denied { read } for pid=1341947 comm="iscsiadm" path="/dev/null" dev="tmpfs" ino=7624775 scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:object_r:container_runtime_tmpfs_t:s0 tclass=chr_file permissive=0
journalctl
juin 08 17:08:45 master avahi-daemon[1309]: New relevant interface vethcd226a42.IPv6 for mDNS.
juin 08 17:08:45 master avahi-daemon[1309]: Registering new address record for fe80::2441:76ff:fe47:ca66 on vethcd226a42.*.
juin 08 17:08:45 master kubelet[627130]: I0608 17:08:45.984915 627130 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="longhorn-system/longhorn-csi-plugin-r7cb5" podStartSL>
juin 08 17:08:46 master avahi-daemon[1309]: Joining mDNS multicast group on interface vethf6cf4c46.IPv6 with address fe80::1c49:55ff:fe17:dfb3.
juin 08 17:08:46 master avahi-daemon[1309]: New relevant interface vethf6cf4c46.IPv6 for mDNS.
juin 08 17:08:46 master avahi-daemon[1309]: Registering new address record for fe80::1c49:55ff:fe17:dfb3 on vethf6cf4c46.*.
juin 08 17:08:46 master kubelet[627130]: I0608 17:08:46.382253 627130 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="longhorn-system/csi-snapshotter-588bd6f85d-bm8pq" pod>
juin 08 17:08:46 master kubelet[627130]: I0608 17:08:46.543372 627130 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: driver.longhorn.io endpoint: /var/lib/kube>
juin 08 17:08:46 master kubelet[627130]: I0608 17:08:46.543413 627130 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: driver.longhorn.io at endpoint: /var/lib/kubelet/plugins/>
Using setenforce Permissive did detach the volume, but in the end, I tried to use echo '(allow iscsid_t self (capability (dac_override)))' > local_longhorn.cil && semodule -vi local_longhorn.cil, but it did not work.
type=AVC msg=audit(1686138208.882:2989): avc: denied { read } for pid=1341947 comm="iscsiadm" path="/dev/null" dev="tmpfs" ino=7624775 scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:object_r:container_runtime_tmpfs_t:s0 tclass=chr_file permissive=0
This one is a new one for me. I'll have to investigate. If anyone reading this also hits it, please chime in.
Hello @ejweber, I was wondering if there have been any updates on the issue we discussed above. Could you please clarify what might be causing this issue and any potential solutions? If you need more information or details from my end to help identify or resolve this issue, please let me know what you need. Thank you in advance for your help.
Hello @ejweber, I wanted to provide you with an update on my situation. After several attempts, I managed to get the volumes attached, but the process was far from seamless. I had to reset my entire cluster and perform a yum remove followed by a yum install for all package on my AlmaLinux. Now, things seem to be working.
However, I have been encountering a warning at the start of each pod launch:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 56m (x2 over 56m) default-scheduler 0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
Normal Scheduled 56m default-scheduler Successfully assigned default/keycloak-postgresql-0 to master
Normal SuccessfulAttachVolume 56m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c1287895-8a07-4906-b507-9e26f24b7722"
Normal Pulled 55m kubelet Container image "docker.io/bitnami/postgresql:11.14.0-debian-10-r28" already present on machine
Normal Created 55m kubelet Created container keycloak-postgresql
Normal Started 55m kubelet Started container keycloak-postgresql
In the longhorn dashboard the status is Degraded...
I'm not sure why this is happening, but it suggests that there might still be an underlying issue. Could this be related to the original problem or is it something new? If you have any advice or thoughts.
I am thinking the second SELinux denial I previously called out is/was a red herring. I have been unable to reproduce it (or any potential negative effects) in my environment, but will still be interested in more data on it in the future.
It looks like there may be resource constraints in the cluster, as that is usually the reason for preemption related messages. The "pod has unbound immediate PersistentVolumeClaims" is no big deal. It just means the scheduler got to the pod before Longhorn finished creating the volume. This is only a problem if it persists for some long period of time.
If the degraded volumes are a persistent issue, I recommend opening a new issue with some additional information, like:
kubectl get -oyaml -n longhorn-system replicas
(so we can see the status of the replicas that have been created).longhorn-manager
running on that node.longhorn-manager
that owns the volume.We can get at all of this if you are able to upload a support bundle.
My first, best guess with limited information is that this is unrelated to the issues above and that some kind of resource pressure is preventing replicas from running.
I also ran into the same rpc error which turned out to be for a different reason but I'll share here for anyone who finds themselves looking at the same thing.
In my instance-manager logs I was getting:
libkmod: kmod_module_insert_module: could not find module by name='iscsi_tcp'\niscsiadm: Could not insert module tcp. Kmod error -2\niscsiadm: iSCSI driver tcp is not loaded. Load the module then retry the command.\niscsiadm: Could not perform SendTargets discovery: iSCSI driver not found. Please make sure it is loaded, and retry the operation\n: exit status 12"
This was because my kernel didn't include the iscsi_tcp module, as detailed here: https://github.com/longhorn/longhorn/issues/1853
longhorn in kind. iscsid is error, so vpc can't mount. iscsid say "sendmsg: bug? ctrl_fd 5".
Describe the bug (🐛 if you encounter this issue)
Longhorn can't attach any volume to any node, i have longhorn installed on my kubernetes cluster, all the longhorn pods are running 1/1, but when i tried to install postgresql on my cluster, the pod are stucked on container creating status, the pod says that longhorn cant attach, saying the error: Warning FailedAttachVolume 9s attachdetach-controller AttachVolume.Attach failed for volume "pvc-f6a5a7b8-a053-4cf0-be54-5ae8141b4b0b" : rpc error: code = DeadlineExceeded desc = volume pvc-f6a5a7b8-a053-4cf0-be54-5ae8141b4b0b failed to attach to node vmrmwrknodeprod06
I have a kubernetes cluster with 9 nodes with sufficient space( the pvc needs 8gb i have 2 tb ) and i've already tried the solution of reducing replica count to 1 and i've already turned on the Replica Node Level Soft Anti-Affinity.
To Reproduce
Steps to reproduce the behavior: Install longhorn on a kubernetes cluster Try to attach any pvc using a storageClass created by this yaml:
Expected behavior
The pvc attached succesfully
Log or Support bundle
The only logs i've found is the pods logs, im trying to run a postgresql pod on my cluster, the pvc has been created succesfully as we can see here:
But in longhorn U.I, the pvc is stuck in attach and dettaching, i've found that this is happening because the postgresql pod are saying this error:
With that i've learned that the pvc is attaching and dettached because when this error occurs, the kubernetes kill the pod and recreated it on another node, so longhorn dettach the node and try to attach again, but when it try to attach, this errors occurs, creating that attach/dettach loop, don't know how to fix this.
Environment
Additional context
Add any other context about the problem here.