[BUG] Degraded volume rebuilding progress exceeds 100% according to UI and volume mount fails

t1murl commented 2 years ago

Describe the bug

The rebuilding process of one replica greatly exceeds 100% progress.

While in this state, the nodes keep transmitting data at max available bandwidth.

After around ~50mins the replica is healthy again in the UI.

I'm not sure whether this is a UI bug, as data had actually been transmitted the entire time.

To Reproduce

Steps to reproduce the behavior:

After deleting a Prometheus pod out of the StatefulSet of the kube-prometheus-stack, the longhorn persistent volume mount failed with: MountVolume.MountDevice failed for volume "pvc-856e2327-6920-4a44-bd43-a65b7aa83692" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692 but could not correct them: fsck from util-linux 2.34 /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692 contains a file system with errors, check forced. /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692: Resize inode not valid. /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options)

At this point all three replicas were healthy.

Deleting the replica on the node where the prometheus pod was supposed to be scheduled resulted in triggering the rebuilding process as expected.

However, the rebuilding takes way longer than the baseline time according to the UI. Even after the rebuilding was reported successful, the volume mount for the affected pods fails.

Only after recreating the Prometheus StatefulSet and recreating the affected volume recovery was possible.

Expected behavior

The rebuilding process finishes as expected, the UI displays the progress correctly and the volume mount is successful.

Log or Support bundle

A support bundle has been created after the rebuilding finally finished. I'll gladly send/upload the support bundle. Please provide me with e.g. an e-mail address, as I'd not want the bundle to be publicly uploaded.

Environment

Longhorn version: v1.2.3
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm (deployed via ArgoCD)
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.21.5+k3s2
- Number of management node in the cluster: 3 (these also schedule workloads)
- Number of worker node in the cluster: 1 (Node Scheduling disabled in Longhorn configuration)
Node config
- OS type and version: k3OS v0.21.5-k3s2r1
- CPU per node: 6
- Memory per node: 18GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: 500Mbit
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM
Number of Longhorn volumes in the cluster: 17

jenting commented 2 years ago

Could you please check whether the volume is still able to be mounted from a single replica? Please refer to the doc.

If not, please run fsck to fix the underlying block device.

For the progress, I think it's a longhorn-manager backend issue.

jenting commented 2 years ago

@smallteeths to double confirm, the replica rebuilding progress bar is output directly from the longhorn-manager backend response, right?

smallteeths commented 2 years ago

Yes I get it from the rebuildStatus in the volume and find the corresponding replica. Progress bar calculated from rebuildStatus[current replica][x].progress. @jenting

It is best to look at the data return of the volume to determine the issue.

jenting commented 2 years ago

Could you please provide us with the support bundle? Attach to here or send to longhorn-support-bundle@suse.com, thx.

jenting commented 2 years ago

Leaving a note here, the progress is calculated at the file-syncing server (receiver side), looks like the sender and receiver side data size is mismatched so encountered the progress over 100%.

t1murl commented 2 years ago

Could you please check whether the volume is still able to be mounted from a single replica? Please refer to the doc.

If not, please run fsck to fix the underlying block device.

For the progress, I think it's a longhorn-manager backend issue.

Hi @jenting , Sorry for my delayed response. Due to the fact that this cluster is used for development it has been teared down and recreated already. I'll gladly try the steps mentioned in the docs once this specific issue reoccurs.

Could you please provide us with the support bundle? Attach to here or send to longhorn-support-bundle@suse.com, thx.

I've just sent the support bundle to the mentioned e-mail address.

Thank you for looking at this issue 👍 . If there's anything else I can provide please gladly let me know.

shuo-wu commented 2 years ago

I just checked replica logs in the support bundle. There are lots of connection reset/timeouts during the rebuilding. I guess this may be the culprit:

During rebuilding, to avoid transmitting duplicate data, the receiver (the new/rebuilding replica) will be asked to compute the checksum before syncing each chunk of the data. And it will increase the rebuilding progress in the checksum function.
Once the rebuilding replica failed to reply to the healthy replica or the healthy replica failed to get the response for the checksum result, it will do retry then lead to the chunks being added to the rebuilding progress multiple times. Finally, the rebuilding progress may be greater than 100%

2021-12-18T19:47:14.529782911Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:47:14Z" level=info msg="Sending file volume-snap-2952a86b-721e-4738-8985-0e3168266795.img to 10.42.2.115:10186"
2021-12-18T19:47:14.529807097Z time="2021-12-18T19:47:14Z" level=info msg="source file size: 68719476736, setting up directIO: true"
2021-12-18T19:53:29.315086685Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3015392: 3015424]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12351045632&end=12351176704\": read tcp 10.42.0.8:58424->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315869463Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3019072: 3019104]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12366118912&end=12366249984\": read tcp 10.42.0.8:54508->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315894311Z time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3017088: 3017120]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12357992448&end=12358123520\": read tcp 10.42.0.8:54506->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315902105Z time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3013824: 3013856]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12344623104&end=12344754176\": read tcp 10.42.0.8:54504->10.42.2.115:10186: read: connection reset by peer"

......

2021-12-18T19:53:32.314620434Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:32Z" level=info msg="Sending file volume-snap-2952a86b-721e-4738-8985-0e3168266795.img to 10.42.2.115:10186"
2021-12-18T19:53:32.315203333Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:32Z" level=info msg="source file size: 68719476736, setting up directIO: true"
2021-12-18T19:53:32.315264139Z time="2021-12-18T19:53:32Z" level=warning msg="Failed to open server: 10.42.2.115:10186, Retrying..."
2021-12-18T19:55:07.624976559Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:55:07Z" level=error msg="writeData for batchInterval: [  744096:  744128]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=3047817216&end=3047948288\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.625961743Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[  739744:  739776]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=3029991424&end=3030122496\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.626020514Z time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[  683360:  683392]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=2799042560&end=2799173632\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.626032036Z time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[  719072:  719104]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=2945318912&end=2945449984\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

But finally, the rebuilding succeeded, the volume should work fine.

shuo-wu commented 2 years ago

@joshimoo @innobead @yasker Do we need to refactor/improve the rebuilding progress for this kind of case?

joshimoo commented 2 years ago

I agree a refactor for our progress tracking in general would be nice to have. Good job on the investigation we should also evaluate our timeout handling/strategy.

I previously mentioned that I don't really like the way we calculate the progress for some operations (snapshot stuff), since the progress calculation shouldn't require counting instead we know bytePosition and maxPosition. So no counting should be needed, this would also fix the above issue where we double count N bytes.

innobead commented 2 years ago

Because it’s finally working, I would say this is low priority but we definitely need to fix it later.

longhorn / longhorn