Open t1murl opened 2 years ago
Could you please check whether the volume is still able to be mounted from a single replica? Please refer to the doc.
If not, please run fsck
to fix the underlying block device.
For the progress, I think it's a longhorn-manager backend issue.
@smallteeths to double confirm, the replica rebuilding progress bar is output directly from the longhorn-manager backend response, right?
Yes I get it from the rebuildStatus in the volume and find the corresponding replica. Progress bar calculated from rebuildStatus[current replica][x].progress. @jenting
It is best to look at the data return of the volume to determine the issue.
Could you please provide us with the support bundle? Attach to here or send to longhorn-support-bundle@suse.com, thx.
Leaving a note here, the progress is calculated at the file-syncing server (receiver side), looks like the sender and receiver side data size is mismatched so encountered the progress over 100%.
Could you please check whether the volume is still able to be mounted from a single replica? Please refer to the doc.
If not, please run
fsck
to fix the underlying block device.For the progress, I think it's a longhorn-manager backend issue.
Hi @jenting , Sorry for my delayed response. Due to the fact that this cluster is used for development it has been teared down and recreated already. I'll gladly try the steps mentioned in the docs once this specific issue reoccurs.
Could you please provide us with the support bundle? Attach to here or send to longhorn-support-bundle@suse.com, thx.
I've just sent the support bundle to the mentioned e-mail address.
Thank you for looking at this issue 👍 . If there's anything else I can provide please gladly let me know.
I just checked replica logs in the support bundle. There are lots of connection reset/timeouts during the rebuilding. I guess this may be the culprit:
2021-12-18T19:47:14.529782911Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:47:14Z" level=info msg="Sending file volume-snap-2952a86b-721e-4738-8985-0e3168266795.img to 10.42.2.115:10186"
2021-12-18T19:47:14.529807097Z time="2021-12-18T19:47:14Z" level=info msg="source file size: 68719476736, setting up directIO: true"
2021-12-18T19:53:29.315086685Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3015392: 3015424]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12351045632&end=12351176704\": read tcp 10.42.0.8:58424->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315869463Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3019072: 3019104]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12366118912&end=12366249984\": read tcp 10.42.0.8:54508->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315894311Z time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3017088: 3017120]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12357992448&end=12358123520\": read tcp 10.42.0.8:54506->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315902105Z time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3013824: 3013856]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12344623104&end=12344754176\": read tcp 10.42.0.8:54504->10.42.2.115:10186: read: connection reset by peer"
......
2021-12-18T19:53:32.314620434Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:32Z" level=info msg="Sending file volume-snap-2952a86b-721e-4738-8985-0e3168266795.img to 10.42.2.115:10186"
2021-12-18T19:53:32.315203333Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:32Z" level=info msg="source file size: 68719476736, setting up directIO: true"
2021-12-18T19:53:32.315264139Z time="2021-12-18T19:53:32Z" level=warning msg="Failed to open server: 10.42.2.115:10186, Retrying..."
2021-12-18T19:55:07.624976559Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:55:07Z" level=error msg="writeData for batchInterval: [ 744096: 744128]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=3047817216&end=3047948288\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.625961743Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[ 739744: 739776]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=3029991424&end=3030122496\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.626020514Z time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[ 683360: 683392]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=2799042560&end=2799173632\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.626032036Z time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[ 719072: 719104]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=2945318912&end=2945449984\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
But finally, the rebuilding succeeded, the volume should work fine.
@joshimoo @innobead @yasker Do we need to refactor/improve the rebuilding progress for this kind of case?
I agree a refactor for our progress tracking in general would be nice to have. Good job on the investigation we should also evaluate our timeout handling/strategy.
I previously mentioned that I don't really like the way we calculate the progress for some operations (snapshot stuff), since the progress calculation shouldn't require counting instead we know bytePosition and maxPosition. So no counting should be needed, this would also fix the above issue where we double count N bytes.
Because it’s finally working, I would say this is low priority but we definitely need to fix it later.
Describe the bug
The rebuilding process of one replica greatly exceeds 100% progress.
While in this state, the nodes keep transmitting data at max available bandwidth.
After around ~50mins the replica is healthy again in the UI.
I'm not sure whether this is a UI bug, as data had actually been transmitted the entire time.
To Reproduce
Steps to reproduce the behavior:
After deleting a Prometheus pod out of the StatefulSet of the kube-prometheus-stack, the longhorn persistent volume mount failed with:
MountVolume.MountDevice failed for volume "pvc-856e2327-6920-4a44-bd43-a65b7aa83692" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692 but could not correct them: fsck from util-linux 2.34 /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692 contains a file system with errors, check forced. /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692: Resize inode not valid. /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options)
At this point all three replicas were healthy.
Deleting the replica on the node where the prometheus pod was supposed to be scheduled resulted in triggering the rebuilding process as expected.
However, the rebuilding takes way longer than the baseline time according to the UI. Even after the rebuilding was reported successful, the volume mount for the affected pods fails.
Only after recreating the Prometheus StatefulSet and recreating the affected volume recovery was possible.
Expected behavior
The rebuilding process finishes as expected, the UI displays the progress correctly and the volume mount is successful.
Log or Support bundle
A support bundle has been created after the rebuilding finally finished. I'll gladly send/upload the support bundle. Please provide me with e.g. an e-mail address, as I'd not want the bundle to be publicly uploaded.
Environment