longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.99k stars 591 forks source link

[BUG] Degraded volume rebuilding progress exceeds 100% according to UI and volume mount fails #3435

Open t1murl opened 2 years ago

t1murl commented 2 years ago

Describe the bug

The rebuilding process of one replica greatly exceeds 100% progress.

image

While in this state, the nodes keep transmitting data at max available bandwidth.

After around ~50mins the replica is healthy again in the UI.

I'm not sure whether this is a UI bug, as data had actually been transmitted the entire time.

To Reproduce

Steps to reproduce the behavior:

After deleting a Prometheus pod out of the StatefulSet of the kube-prometheus-stack, the longhorn persistent volume mount failed with: MountVolume.MountDevice failed for volume "pvc-856e2327-6920-4a44-bd43-a65b7aa83692" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692 but could not correct them: fsck from util-linux 2.34 /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692 contains a file system with errors, check forced. /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692: Resize inode not valid. /dev/longhorn/pvc-856e2327-6920-4a44-bd43-a65b7aa83692: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options)

At this point all three replicas were healthy.

Deleting the replica on the node where the prometheus pod was supposed to be scheduled resulted in triggering the rebuilding process as expected.

However, the rebuilding takes way longer than the baseline time according to the UI. Even after the rebuilding was reported successful, the volume mount for the affected pods fails.

Only after recreating the Prometheus StatefulSet and recreating the affected volume recovery was possible.

Expected behavior

The rebuilding process finishes as expected, the UI displays the progress correctly and the volume mount is successful.

Log or Support bundle

A support bundle has been created after the rebuilding finally finished. I'll gladly send/upload the support bundle. Please provide me with e.g. an e-mail address, as I'd not want the bundle to be publicly uploaded.

Environment

jenting commented 2 years ago

Could you please check whether the volume is still able to be mounted from a single replica? Please refer to the doc.

If not, please run fsck to fix the underlying block device.

For the progress, I think it's a longhorn-manager backend issue.

jenting commented 2 years ago

@smallteeths to double confirm, the replica rebuilding progress bar is output directly from the longhorn-manager backend response, right?

smallteeths commented 2 years ago

Yes I get it from the rebuildStatus in the volume and find the corresponding replica. Progress bar calculated from rebuildStatus[current replica][x].progress. @jenting

It is best to look at the data return of the volume to determine the issue.

jenting commented 2 years ago

Could you please provide us with the support bundle? Attach to here or send to longhorn-support-bundle@suse.com, thx.

jenting commented 2 years ago

Leaving a note here, the progress is calculated at the file-syncing server (receiver side), looks like the sender and receiver side data size is mismatched so encountered the progress over 100%.

t1murl commented 2 years ago

Could you please check whether the volume is still able to be mounted from a single replica? Please refer to the doc.

If not, please run fsck to fix the underlying block device.

For the progress, I think it's a longhorn-manager backend issue.

Hi @jenting , Sorry for my delayed response. Due to the fact that this cluster is used for development it has been teared down and recreated already. I'll gladly try the steps mentioned in the docs once this specific issue reoccurs.

Could you please provide us with the support bundle? Attach to here or send to longhorn-support-bundle@suse.com, thx.

I've just sent the support bundle to the mentioned e-mail address.

Thank you for looking at this issue 👍 . If there's anything else I can provide please gladly let me know.

shuo-wu commented 2 years ago

I just checked replica logs in the support bundle. There are lots of connection reset/timeouts during the rebuilding. I guess this may be the culprit:

2021-12-18T19:47:14.529782911Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:47:14Z" level=info msg="Sending file volume-snap-2952a86b-721e-4738-8985-0e3168266795.img to 10.42.2.115:10186"
2021-12-18T19:47:14.529807097Z time="2021-12-18T19:47:14Z" level=info msg="source file size: 68719476736, setting up directIO: true"
2021-12-18T19:53:29.315086685Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3015392: 3015424]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12351045632&end=12351176704\": read tcp 10.42.0.8:58424->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315869463Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3019072: 3019104]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12366118912&end=12366249984\": read tcp 10.42.0.8:54508->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315894311Z time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3017088: 3017120]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12357992448&end=12358123520\": read tcp 10.42.0.8:54506->10.42.2.115:10186: read: connection reset by peer"
2021-12-18T19:53:29.315902105Z time="2021-12-18T19:53:29Z" level=error msg="writeData for batchInterval: [ 3013824: 3013856]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=12344623104&end=12344754176\": read tcp 10.42.0.8:54504->10.42.2.115:10186: read: connection reset by peer"

......

2021-12-18T19:53:32.314620434Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:32Z" level=info msg="Sending file volume-snap-2952a86b-721e-4738-8985-0e3168266795.img to 10.42.2.115:10186"
2021-12-18T19:53:32.315203333Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:53:32Z" level=info msg="source file size: 68719476736, setting up directIO: true"
2021-12-18T19:53:32.315264139Z time="2021-12-18T19:53:32Z" level=warning msg="Failed to open server: 10.42.2.115:10186, Retrying..."
2021-12-18T19:55:07.624976559Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:55:07Z" level=error msg="writeData for batchInterval: [  744096:  744128]( 32) failed, err: writeData failed, err: Post \"http://10.42.2.115:10186/v1-ssync/writeData?begin=3047817216&end=3047948288\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.625961743Z [pvc-856e2327-6920-4a44-bd43-a65b7aa83692-r-5edc2dbc] time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[  739744:  739776]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=3029991424&end=3030122496\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.626020514Z time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[  683360:  683392]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=2799042560&end=2799173632\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2021-12-18T19:55:07.626032036Z time="2021-12-18T19:55:07Z" level=error msg="getServerChecksum batchInterval:[  719072:  719104]( 32) failed, err: getChecksum failed, err: Get \"http://10.42.2.115:10186/v1-ssync/getChecksum?begin=2945318912&end=2945449984\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

But finally, the rebuilding succeeded, the volume should work fine.

shuo-wu commented 2 years ago

@joshimoo @innobead @yasker Do we need to refactor/improve the rebuilding progress for this kind of case?

joshimoo commented 2 years ago

I agree a refactor for our progress tracking in general would be nice to have. Good job on the investigation we should also evaluate our timeout handling/strategy.

I previously mentioned that I don't really like the way we calculate the progress for some operations (snapshot stuff), since the progress calculation shouldn't require counting instead we know bytePosition and maxPosition. So no counting should be needed, this would also fix the above issue where we double count N bytes.

innobead commented 2 years ago

Because it’s finally working, I would say this is low priority but we definitely need to fix it later.