Closed DamiaSan closed 4 months ago
cannot check job log and longhorn-manager log?
cc @ChanYiLin
@derekbit
I've narrowed it down by turning off recurring backups, but the issue is still present. The backups might have been a coincidence because they've ran every hour (at 0 minutes mark) while this issue happens every 30 minutes (:00 and :30 minutes mark).
The 8s
timeout thing appears in all logs, though I can't figure out what's the culprit of this issue. The instance manager log says it couldn't read from the tcp://10.42.0.225:10000
(which is IP address of the very same instance manager pod). The k3s log says something about being unable to read/write to its database (slow queries).
The dmesg log says about faulty block devices. The dev sdf
device, this log mentions, does not exist right now. I assume it's some kind of temporary drive that spawns during some periodic event which happens every 30 minutes?
The k8s StorageClass
used in all PVCs
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: longhorn-persistent
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "1"
fromBackup: ""
fsType: "ext4"
Attaching:
IO timeout error
2024-06-18T14:30:12.676407568+02:00 Jun 18 12:30:03.901872: ->10.42.0.225:10001 W[ 4kB] 8750903us failed
Can you provide more information about your environment?
- Longhorn version:
- Impacted volume (PV): <!-- PLEASE specify the volume name to better identify the cause -->
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
sure thing @derekbit
worth mentioning re PV that the faulty one is 1G whilst the other 1G one doesn't have this issue. The difference between these two is that the faulty one keeps mysql data while the other one rabbitmq data. The faulty one uses ~200MB while the rabbit one ~30MB.
I'm not sure if the 8-seconds timeout is caused by the disk or CPU starvation. If possible, can you scale down the volume, update the global setting engine-replica-timeout
to a bigger value (15 seconds), scale it up and try to back it up again?
BTW, can you provide a support bundle as well? You can send it to longhorn-support-bundle@suse.com. We would like to check more details in it. Thank you.
So I've increased the timeout from 8 to 20s, but the pvc is still failing, however I can notice that it takes longer for it to fail.
If you see the dstat
screenshot below, at 09:00:09
mark the load avg starts to rapidly rise and it keep rising until 09:00:53
, then starts dropping. Before this change it has been rising until ~00:35 seconds mark thus increasing the timeout prolonged this load avg rise.
Simultaneously I've ran iostat
(log attached below) and noticed that the virtual disk (sdd
), which is the faulty pvc, has %util
param at 100
(ie. fully exhausted), then after the longhorn timeout passes, it drops to 0. Afterwards sda
, the physical disk, has a spike, and finally it settles.
I've sent bundle to your email address. Thanks!
So I've increased the timeout from 8 to 20s, but the pvc is still failing, however I can notice that it takes longer for it to fail. If you see the
dstat
screenshot below, at09:00:09
mark the load avg starts to rapidly rise and it keep rising until09:00:53
, then starts dropping. Before this change it has been rising until ~00:35 seconds mark thus increasing the timeout prolonged this load avg rise.Simultaneously I've ran
iostat
(log attached below) and noticed that the virtual disk (sdd
), which is the faulty pvc, has%util
param at100
(ie. fully exhausted), then after the longhorn timeout passes, it drops to 0. Afterwardssda
, the physical disk, has a spike, and finally it settles.I've sent bundle to your email address. Thanks!
iostat log
Thanks @mike-code. Can you adjust the global setting backup-compression-method to 1 for reducing the IO consumption of a backup?
did that but no budge.
now I noticed it's not just longhorn/k3s that freezes but the IO on disk is not working. even doing touch foo
will not work until this ~20 seconds pass
we've found the culprit being unrelated to longhorn. this issue may be closed.
we've found the culprit being unrelated to longhorn. this issue may be closed.
@mike-code Can you elaborate more on the culprit? Thank you.
we've found the culprit being unrelated to longhorn. this issue may be closed.
@mike-code Can you elaborate more on the culprit? Thank you.
@derekbit sure. the k3s cluster is running in a VM. The VM management (proxmox) has been taking snapshots every 30 minutes, coincidentally in the same period as events on longhorn. These snapshots were (to some extent) partially locking the qemu drive thus leading to a severe hiccup
I've posted wrong information that the sever is bare metal (that's what I thought back then).
@mike-code Thanks for the update!
Discussed in https://github.com/longhorn/longhorn/discussions/8765