Open cod-r opened 1 year ago
Can you help check if there are io timeouts error for the problematic volume in the instance-manager-r? BTW, what's your network bandwidth?
I added the instance manager logs in the first message, click on the dropdown.
Yes I can see some timeout errors there. But my problem is why the pods are not restarted automatically.
EDIT: I also have a support bundle but I cannot share it here.
I added the instance manager logs in the first message, click on the dropdown.
Yes I can see some timeout errors there. But my problem is why the pods are not restarted automatically.
Probably hit the similar issue https://github.com/longhorn/longhorn/issues/3325 Can you provide the support bundle for further investigation? The events are not enough for checking. Please send to longhorn-support-bundle@suse.com.
Sent the support bundle.
We have 10Gbps connection between nodes.
I did a test with iperf3
between two pods on different nodes and these are the results:
root@helloworld-775cbcd879-qkmpt:/# iperf3 -c 10.42.11.102
Connecting to host 10.42.11.102, port 5201
[ 5] local 10.42.10.254 port 42964 connected to 10.42.11.102 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 175 MBytes 1.47 Gbits/sec 61 703 KBytes
[ 5] 1.00-2.00 sec 165 MBytes 1.38 Gbits/sec 0 856 KBytes
[ 5] 2.00-3.00 sec 181 MBytes 1.52 Gbits/sec 70 636 KBytes
[ 5] 3.00-4.00 sec 181 MBytes 1.52 Gbits/sec 57 616 KBytes
[ 5] 4.00-5.00 sec 191 MBytes 1.60 Gbits/sec 21 635 KBytes
[ 5] 5.00-6.00 sec 175 MBytes 1.47 Gbits/sec 0 811 KBytes
[ 5] 6.00-7.00 sec 199 MBytes 1.67 Gbits/sec 152 759 KBytes
[ 5] 7.00-8.00 sec 196 MBytes 1.65 Gbits/sec 40 707 KBytes
[ 5] 8.00-9.00 sec 179 MBytes 1.50 Gbits/sec 83 629 KBytes
[ 5] 9.00-10.00 sec 180 MBytes 1.51 Gbits/sec 46 638 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.78 GBytes 1.53 Gbits/sec 530 sender
[ 5] 0.00-10.00 sec 1.78 GBytes 1.53 Gbits/sec receiver
Does longhorn need higher speeds?
Describe the bug (🐛 if you encounter this issue)
Loki and Prometheus are getting read-only filesystem errors and the pods are not restarted even though I have enabled
Automatically Delete Workload Pod when The Volume Is Detached Unexpectedly
This is happening every few days and the only way to solve the problem is to manually restart the pods.
Expected behavior
The pods are automatically restarted by longhorn.
Log or Support bundle
Click for Instance manager logs
``` codr@macos ~ % kubectl -n longhorn-system logs instance-manager-e-a55819dc472ddea49d8d93daa3adbfd5 | grep pvc-fa88e86d-c641-47c7 [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:50Z" level=error msg="R/W Timeout. No response received in 8s" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:50Z" level=error msg="Setting replica tcp://10.42.10.73:11545 to ERR due to: r/w timeout" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:12:31.502089: ->10.42.11.181:11516 W[792kB] 16037us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:12:47.360728: ->10.42.8.160:11291 W[ 4kB] 1270us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:13:17.304601: ->10.42.8.160:11291 W[ 16kB] 1841us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:13:59.784349: ->10.42.8.160:11291 P[ 0kB] 556us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:14:42.220156: ->10.42.8.160:11291 W[ 4kB] 2170us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:15:12.249499: ->10.42.8.160:11291 W[ 8kB] 913us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:15:23.889272: ->10.42.11.181:11516 P[ 0kB] 1113us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169566 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169570 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169575 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169580 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=error msg="Error reading from wire 10.42.10.73:11546" error=EOF [longhorn-instance-manager] time="2023-03-14T13:15:57Z" level=info msg="Process Manager: start getting logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [longhorn-instance-manager] time="2023-03-14T13:15:57Z" level=info msg="Process Manager: got logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:59Z" level=info msg="Removing backend: tcp://10.42.10.73:11545" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:59Z" level=info msg="Closing: 10.42.10.73:11545" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Opening remote: 10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Starting to snapshot: 10.42.11.181:11515 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Starting to snapshot: 10.42.8.160:11290 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Finished to snapshot: 10.42.8.160:11290 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Finished to snapshot: 10.42.11.181:11515 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Finished to snapshot: 10.42.10.73:10135 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Adding backend: tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Set revision counter of 10.42.10.73:10135 to : 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Set backend tcp://10.42.10.73:10135 revision counter to 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Synchronizing volume-head-010.img.meta to volume-head-003.img.meta:10.42.10.73:10138" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:54Z" level=info msg="Done synchronizing volume-head-010.img.meta to volume-head-003.img.meta:10.42.10.73:10138" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:03Z" level=error msg="R/W Timeout. No response received in 8s" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:03Z" level=error msg="Setting replica tcp://10.42.10.73:10135 to ERR due to: r/w timeout" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:18:29.889019: ->10.42.11.181:11516 P[ 0kB] 1091us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:18:42.287932: ->10.42.8.160:11291 W[ 4kB] 897us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:19:11.889335: ->10.42.11.181:11516 P[ 0kB] 1098us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:25Z" level=info msg="Removing backend: tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:25Z" level=info msg="Closing: 10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:21:07Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:22:05Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:22:07Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:22:37Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:23:07Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:23:37Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:24:00Z" level=warning msg="Received response message id 0 seq 17 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:24:38Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:25:08Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:26:38Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:27:08Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:28:16Z" level=warning msg="Received response message id 0 seq 18 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:28:16Z" level=warning msg="Received response message id 0 seq 19 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:02Z" level=error msg="Error reading from wire 10.42.10.73:10136" error=EOF [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Opening remote: 10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Starting to snapshot: 10.42.11.181:11515 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Starting to snapshot: 10.42.8.160:11290 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Finished to snapshot: 10.42.8.160:11290 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Finished to snapshot: 10.42.11.181:11515 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Finished to snapshot: 10.42.10.73:10135 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Start monitoring tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Set revision counter of 10.42.10.73:10135 to : 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Set backend tcp://10.42.10.73:10135 revision counter to 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=warning msg="Removed extra disks map[volume-snap-daily-sn-f3cb275f-7099-44b8-874b-3164a13f2b56.img:{volume-snap-daily-sn-f3cb275f-7099-44b8-874b-3164a13f2b56.img map[volume-snap-expand-21474836480.img:true] false true 2023-03-14T04:01:54Z 3014864896 map[RecurringJob:daily-snapshot]} volume-snap-expand-21474836480.img:{volume-snap-expand-21474836480.img volume-snap-daily-sn-f3cb275f-7099-44b8-874b-3164a13f2b56.img map[volume-snap-da1ff5a4-75bd-4d95-af2d-8a07696cf07d.img:true] false false 2023-03-14T13:36:08Z 682557440 map[replica-expansion:21474836480]}] in replica tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Synchronizing volume-head-001.img.meta to volume-head-004.img.meta:10.42.10.73:10138" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Done synchronizing volume-head-001.img.meta to volume-head-004.img.meta:10.42.10.73:10138" [longhorn-instance-manager] time="2023-03-14T13:36:18Z" level=info msg="Process Manager: start getting logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [longhorn-instance-manager] time="2023-03-14T13:36:19Z" level=info msg="Process Manager: got logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:19Z" level=error msg="Error reading from wire 10.42.10.73:10136" error=EOF [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:19Z" level=info msg="Removing backend: tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:31Z" level=info msg="Connecting to remote: 10.42.9.21:10030 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:31Z" level=info msg="Opening remote: 10.42.9.21:10030" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Finished to snapshot: 10.42.8.160:11290 248d9917-894d-4c3c-8fef-ecd2d5f55083 UserCreated false Created at 2023-03-14T13:36:32Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Set backend tcp://10.42.9.21:10030 revision counter to 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Synchronizing volume-head-002.img.meta to volume-head-001.img.meta:10.42.9.21:10033" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Done synchronizing volume-head-002.img.meta to volume-head-001.img.meta:10.42.9.21:10033" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:40:46Z" level=info msg="Got backend tcp://10.42.11.181:11515 revision counter 2607299" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:40:46Z" level=info msg="Set backend tcp://10.42.9.21:10030 revision counter to 2607299" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:40:46Z" level=info msg="Setting replica tcp://10.42.9.21:10030 to mode RW" ```Environment