longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.04k stars 595 forks source link

[BUG] Read-only filesystem - pod is not automatically deleted #5561

Open cod-r opened 1 year ago

cod-r commented 1 year ago

Describe the bug (🐛 if you encounter this issue)

Loki and Prometheus are getting read-only filesystem errors and the pods are not restarted even though I have enabled Automatically Delete Workload Pod when The Volume Is Detached Unexpectedly

This is happening every few days and the only way to solve the problem is to manually restart the pods.

Expected behavior

The pods are automatically restarted by longhorn.

Log or Support bundle

codr@macos ~ % kubectl -n longhorn-system get event | grep pvc-fa88e86d-c641-47c7
36m         Warning   Faulted                  engine/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445    Detected replica pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41b036df (10.42.10.73:11545) in error
36m         Normal    Delete                   engine/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445    Removing unknown replica tcp://10.42.10.73:11545 in mode ERR from engine
36m         Normal    Delete                   engine/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445    Removed unknown replica tcp://10.42.10.73:11545 in mode ERR from engine
18m         Normal    Rebuilding               engine/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445    Start rebuilding replica pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41b036df with Address 10.42.10.73:10135 for normal engine pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445 and volume pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38
32m         Normal    Rebuilding               engine/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445    Detected rebuilding replica pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41b036df (10.42.10.73:10135)
31m         Warning   FailedRebuilding         engine/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445    Failed rebuilding replica with Address 10.42.10.73:10135: proxyServer=10.42.8.25:8501 destination=10.42.8.25:10026: failed to add replica tcp://10.42.10.73:10135 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-70b72dcf-82f2-426c-8d2a-ab33f329864d.img ToFileName:volume-snap-70b72dcf-82f2-426c-8d2a-ab33f329864d.img ActualSize:2264690688} {FromFileName:volume-snap-70b72dcf-82f2-426c-8d2a-ab33f329864d.img.meta ToFileName:volume-snap-70b72dcf-82f2-426c-8d2a-ab33f329864d.img.meta ActualSize:0} {FromFileName:volume-snap-daily-sn-c3b08ac7-86d4-40ad-8e93-be859bf9f94c.img ToFileName:volume-snap-daily-sn-c3b08ac7-86d4-40ad-8e93-be859bf9f94c.img ActualSize:20653568000} {FromFileName:volume-snap-daily-sn-c3b08ac7-86d4-40ad-8e93-be859bf9f94c.img.meta ToFileName:volume-snap-daily-sn-c3b08ac7-86d4-40ad-8e93-be859bf9f94c.img.meta ActualSize:0}] from tcp://10.42.8.160:11290: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
27m         Warning   FailedRebuilding         engine/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445    Failed rebuilding replica with Address 10.42.10.73:10135: proxyServer=10.42.8.25:8501 destination=10.42.8.25:10026: failed to add replica tcp://10.42.10.73:10135 for volume: rpc error: code = Unknown desc = failed to create replica tcp://10.42.10.73:10135 for volume 10.42.8.25:10026: rpc error: code = Unknown desc = replica must be closed, cannot add in state: rebuilding
16m         Normal    Start                    replica/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41635581   Starts pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41635581
16m         Normal    Start                    replica/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41b036df   Starts pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41b036df
16m         Normal    Stop                     replica/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41b036df   Stops pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-r-41b036df
36m         Normal    Degraded                 volume/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38               volume pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38 became degraded
11m         Normal    Healthy                  volume/pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38               volume pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38 became healthy
dmesg in the prometheus pod:

[1064807.029410] sd 20:0:0:1: rejecting I/O to offline device
[1064807.031964] Buffer I/O error on dev sdo, logical block 262144, lost sync page write
[1064807.034510] sd 20:0:0:1: rejecting I/O to offline device
[1064807.034577] JBD2: Error -5 detected when updating journal superblock for sdo-8.
[1064807.037221] Buffer I/O error on dev sdo, logical block 0, lost sync page write
[1064807.039751] JBD2: Detected IO errors while flushing file data on sdo-8
[1064807.042180] EXT4-fs (sdo): I/O error while writing superblock
[1064807.042216] sd 20:0:0:1: rejecting I/O to offline device
[1064807.045413] EXT4-fs error (device sdo): ext4_journal_check_start:61: Detected aborted journal
[1064807.048600] Buffer I/O error on dev sdo, logical block 0, lost sync page write
[1064807.051748] EXT4-fs (sdo): Remounting filesystem read-only
[1064807.059646] EXT4-fs (sdo): I/O error while writing superblock
[1064807.062208] EXT4-fs error (device sdo): ext4_journal_check_start:61: Detected aborted journal
[1065369.548608] sd 18:0:0:1: Power-on or device reset occurred
[1065369.564329] sd 18:0:0:1: [sdm] tag#81 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1065369.564338] sd 18:0:0:1: [sdm] tag#81 Sense Key : Medium Error [current]
[1065369.564344] sd 18:0:0:1: [sdm] tag#81 Add. Sense: Unrecovered read error
[1065369.564348] sd 18:0:0:1: [sdm] tag#81 CDB: Write(10) 2a 08 00 00 00 00 00 00 08 00
[1065369.564356] print_req_error: 4 callbacks suppressed
[1065369.564363] blk_update_request: critical medium error, dev sdm, sector 0 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[1065369.568936] Buffer I/O error on dev sdm, logical block 0, lost sync page write
[1065369.570526] EXT4-fs (sdm): I/O error while writing superblock
[1065369.571670] EXT4-fs error (device sdm): ext4_journal_check_start:61: Detected aborted journal
[1065369.572809] EXT4-fs (sdm): Remounting filesystem read-only
Click for Instance manager logs ``` codr@macos ~ % kubectl -n longhorn-system logs instance-manager-e-a55819dc472ddea49d8d93daa3adbfd5 | grep pvc-fa88e86d-c641-47c7 [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:50Z" level=error msg="R/W Timeout. No response received in 8s" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:50Z" level=error msg="Setting replica tcp://10.42.10.73:11545 to ERR due to: r/w timeout" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:12:31.502089: ->10.42.11.181:11516 W[792kB] 16037us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:12:47.360728: ->10.42.8.160:11291 W[ 4kB] 1270us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:13:17.304601: ->10.42.8.160:11291 W[ 16kB] 1841us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:13:59.784349: ->10.42.8.160:11291 P[ 0kB] 556us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:14:42.220156: ->10.42.8.160:11291 W[ 4kB] 2170us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:15:12.249499: ->10.42.8.160:11291 W[ 8kB] 913us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:15:23.889272: ->10.42.11.181:11516 P[ 0kB] 1113us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169566 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169570 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169575 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=warning msg="Received response message id 0 seq 169580 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:56Z" level=error msg="Error reading from wire 10.42.10.73:11546" error=EOF [longhorn-instance-manager] time="2023-03-14T13:15:57Z" level=info msg="Process Manager: start getting logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [longhorn-instance-manager] time="2023-03-14T13:15:57Z" level=info msg="Process Manager: got logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:59Z" level=info msg="Removing backend: tcp://10.42.10.73:11545" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:15:59Z" level=info msg="Closing: 10.42.10.73:11545" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Opening remote: 10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Starting to snapshot: 10.42.11.181:11515 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Starting to snapshot: 10.42.8.160:11290 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Finished to snapshot: 10.42.8.160:11290 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Finished to snapshot: 10.42.11.181:11515 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Finished to snapshot: 10.42.10.73:10135 70b72dcf-82f2-426c-8d2a-ab33f329864d UserCreated false Created at 2023-03-14T13:19:24Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Adding backend: tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Set revision counter of 10.42.10.73:10135 to : 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Set backend tcp://10.42.10.73:10135 revision counter to 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:24Z" level=info msg="Synchronizing volume-head-010.img.meta to volume-head-003.img.meta:10.42.10.73:10138" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:19:54Z" level=info msg="Done synchronizing volume-head-010.img.meta to volume-head-003.img.meta:10.42.10.73:10138" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:03Z" level=error msg="R/W Timeout. No response received in 8s" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:03Z" level=error msg="Setting replica tcp://10.42.10.73:10135 to ERR due to: r/w timeout" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:18:29.889019: ->10.42.11.181:11516 P[ 0kB] 1091us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:18:42.287932: ->10.42.8.160:11291 W[ 4kB] 897us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] Mar 14 13:19:11.889335: ->10.42.11.181:11516 P[ 0kB] 1098us [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:25Z" level=info msg="Removing backend: tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:20:25Z" level=info msg="Closing: 10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:21:07Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:22:05Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:22:07Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:22:37Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:23:07Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:23:37Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:24:00Z" level=warning msg="Received response message id 0 seq 17 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:24:38Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:25:08Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:26:38Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:27:08Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:28:16Z" level=warning msg="Received response message id 0 seq 18 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:28:16Z" level=warning msg="Received response message id 0 seq 19 type 2 for non pending request" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:02Z" level=error msg="Error reading from wire 10.42.10.73:10136" error=EOF [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Connecting to remote: 10.42.10.73:10135 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Opening remote: 10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Starting to snapshot: 10.42.11.181:11515 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Starting to snapshot: 10.42.8.160:11290 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Finished to snapshot: 10.42.8.160:11290 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Finished to snapshot: 10.42.11.181:11515 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Finished to snapshot: 10.42.10.73:10135 da1ff5a4-75bd-4d95-af2d-8a07696cf07d UserCreated false Created at 2023-03-14T13:36:08Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Start monitoring tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Set revision counter of 10.42.10.73:10135 to : 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Set backend tcp://10.42.10.73:10135 revision counter to 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=warning msg="Removed extra disks map[volume-snap-daily-sn-f3cb275f-7099-44b8-874b-3164a13f2b56.img:{volume-snap-daily-sn-f3cb275f-7099-44b8-874b-3164a13f2b56.img map[volume-snap-expand-21474836480.img:true] false true 2023-03-14T04:01:54Z 3014864896 map[RecurringJob:daily-snapshot]} volume-snap-expand-21474836480.img:{volume-snap-expand-21474836480.img volume-snap-daily-sn-f3cb275f-7099-44b8-874b-3164a13f2b56.img map[volume-snap-da1ff5a4-75bd-4d95-af2d-8a07696cf07d.img:true] false false 2023-03-14T13:36:08Z 682557440 map[replica-expansion:21474836480]}] in replica tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Synchronizing volume-head-001.img.meta to volume-head-004.img.meta:10.42.10.73:10138" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:08Z" level=info msg="Done synchronizing volume-head-001.img.meta to volume-head-004.img.meta:10.42.10.73:10138" [longhorn-instance-manager] time="2023-03-14T13:36:18Z" level=info msg="Process Manager: start getting logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [longhorn-instance-manager] time="2023-03-14T13:36:19Z" level=info msg="Process Manager: got logs for process pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:19Z" level=error msg="Error reading from wire 10.42.10.73:10136" error=EOF [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:19Z" level=info msg="Removing backend: tcp://10.42.10.73:10135" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:31Z" level=info msg="Connecting to remote: 10.42.9.21:10030 (tcp)" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:31Z" level=info msg="Opening remote: 10.42.9.21:10030" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Finished to snapshot: 10.42.8.160:11290 248d9917-894d-4c3c-8fef-ecd2d5f55083 UserCreated false Created at 2023-03-14T13:36:32Z, Labels map[]" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Set backend tcp://10.42.9.21:10030 revision counter to 0" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Synchronizing volume-head-002.img.meta to volume-head-001.img.meta:10.42.9.21:10033" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:36:32Z" level=info msg="Done synchronizing volume-head-002.img.meta to volume-head-001.img.meta:10.42.9.21:10033" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:40:46Z" level=info msg="Got backend tcp://10.42.11.181:11515 revision counter 2607299" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:40:46Z" level=info msg="Set backend tcp://10.42.9.21:10030 revision counter to 2607299" [pvc-fa88e86d-c641-47c7-be50-0413b6b3ff38-e-0baec445] time="2023-03-14T13:40:46Z" level=info msg="Setting replica tcp://10.42.9.21:10030 to mode RW" ```

Environment

derekbit commented 1 year ago

Can you help check if there are io timeouts error for the problematic volume in the instance-manager-r? BTW, what's your network bandwidth?

cod-r commented 1 year ago

I added the instance manager logs in the first message, click on the dropdown.

Yes I can see some timeout errors there. But my problem is why the pods are not restarted automatically.

EDIT: I also have a support bundle but I cannot share it here.

derekbit commented 1 year ago

I added the instance manager logs in the first message, click on the dropdown.

Yes I can see some timeout errors there. But my problem is why the pods are not restarted automatically.

Probably hit the similar issue https://github.com/longhorn/longhorn/issues/3325 Can you provide the support bundle for further investigation? The events are not enough for checking. Please send to longhorn-support-bundle@suse.com.

cod-r commented 1 year ago

Sent the support bundle.

cod-r commented 1 year ago

We have 10Gbps connection between nodes. I did a test with iperf3 between two pods on different nodes and these are the results:

root@helloworld-775cbcd879-qkmpt:/# iperf3 -c 10.42.11.102
Connecting to host 10.42.11.102, port 5201
[  5] local 10.42.10.254 port 42964 connected to 10.42.11.102 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   175 MBytes  1.47 Gbits/sec   61    703 KBytes       
[  5]   1.00-2.00   sec   165 MBytes  1.38 Gbits/sec    0    856 KBytes       
[  5]   2.00-3.00   sec   181 MBytes  1.52 Gbits/sec   70    636 KBytes       
[  5]   3.00-4.00   sec   181 MBytes  1.52 Gbits/sec   57    616 KBytes       
[  5]   4.00-5.00   sec   191 MBytes  1.60 Gbits/sec   21    635 KBytes       
[  5]   5.00-6.00   sec   175 MBytes  1.47 Gbits/sec    0    811 KBytes       
[  5]   6.00-7.00   sec   199 MBytes  1.67 Gbits/sec  152    759 KBytes       
[  5]   7.00-8.00   sec   196 MBytes  1.65 Gbits/sec   40    707 KBytes       
[  5]   8.00-9.00   sec   179 MBytes  1.50 Gbits/sec   83    629 KBytes       
[  5]   9.00-10.00  sec   180 MBytes  1.51 Gbits/sec   46    638 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.78 GBytes  1.53 Gbits/sec  530             sender
[  5]   0.00-10.00  sec  1.78 GBytes  1.53 Gbits/sec                  receiver

Does longhorn need higher speeds?