[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot

yangchiu commented 8 months ago

Describe the bug

Create a volume and create a backup recurring job with cron * * * * * to create a backup every minute and retain = 1 to only keep one backup.
Keep writing data to the volume.

Reboot a node of a replica. After the node and the volume are back to work. The recurring job could still work abnormally. Lots of Error backup remain to the backup list: backup-error

Even though a very long time after the reboot event, newly created backup could still randomly run into Error state with error message:

Error:                      proxyServer=10.42.3.10:8501 destination=10.42.1.72:10000: 
failed to backup snapshot vol-yde7-971b6006-6f18-4d12-85dc-c337a72d5220 to backup-5b793b248f96439d: 
rpc error: code = Unknown desc = failed to create backup to s3://backupbucket@us-east-1/backupstore 
for volume vol-yde769: rpc error: code = Unknown 
desc = failed lock backupstore/volumes/e2/eb/vol-yde769/locks/lock-5242291888d04df6.lck type 1 acquisition

To Reproduce

Checkout PR https://github.com/longhorn/longhorn-tests/pull/1539 and run test case Reboot Replica Node While Heavy Writing And Recurring Jobs Exist

Expected behavior

Support bundle for troubleshooting

supportbundle_359c9225-c5aa-42d3-9af0-6a301e68c081_2024-02-06T05-54-16Z.zip

Environment

Longhorn version: v1.5.4-rc1
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: sles 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

innobead commented 8 months ago

I assume this also happened in 1.6.x. @yangchiu you missed the severity label.

innobead commented 8 months ago

@james-munson Please help with this. Thanks.

yangchiu commented 8 months ago

Could be difficult to reproduce. The test case could fail due to different reasons instead of what described in https://github.com/longhorn/longhorn/issues/7854#issue-2120056364.

It seems hard to predict when will a backup creation can be triggered by a recurring job after a node reboot.

@james-munson Please don't try to reproduce it by running the test case since it could be diverse. The test case needed to be refined first. Please look into the support bundle to see if there is any clue, if not, I suggest not to dig into this and wait until the PR is refined and merged. Sorry for the inconvenience.

james-munson commented 8 months ago

The error comes from backupstore https://github.com/longhorn/backupstore/blob/9c90a8728ca2e1f95f10a4c6cf67cf7ac20c23e4/lock.go#L104-111

In logs/longhorn-system/instance-manager-a98079d3080d186b5dd5471b92655380/instance-manager.log

2024-02-06T05:53:24.019426177Z time="2024-02-06T05:53:24Z" level=info msg="Trying to acquire lock backupstore/volumes/e2/eb/vol-yde769/locks/lock-5242291888d04df6.lck" func="backupstore.(*FileLock).canAcquire" file="lock.go:65" lock="{ volume: vol-yde769, name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC }" pkg=backupstore
2024-02-06T05:53:24.019431667Z time="2024-02-06T05:53:24Z" level=info msg="backupstore volume vol-yde769 contains locks [{ volume: , name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC } { volume: , name: lock-e519ec1bc88a4080, type: 2, acquired: true, serverTime: 2024-02-06 05:53:17 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore

So the BACKUP_LOCK (LockType 1) lock-5242291888d04df6 is rejected by the already-acquired and unexpired DELETION_LOCK (LockType 2) lock-e519ec1bc88a4080 from 7 seconds before. What's up with that one?

james-munson commented 8 months ago

There are 16 such collisions between a DELETION_LOCK and a BACKUP_LOCK for the two volumes over the previous 90 minutes, each leading to a residual failed backup CR. So either the deletion is taking surprisingly long, or the recurrence schedule of once a minute is pushing too hard.

Is the node restart having an impact on this, or does it happen even without that? The last node restart was at 2024-02-06T04:34:17, and its rebuild finished shortly after:

grep healthyAt replicas.yaml 
      healthyAt: "2024-02-06T04:16:50Z"
      healthyAt: "2024-02-06T04:16:50Z"
      healthyAt: "2024-02-06T04:16:41Z"
      healthyAt: "2024-02-06T04:35:16Z"

and it's hard to see how it would still be having ripple effects an hour and a half later.

The times of the failures are well spread out

grep -h "failed lock" logs/longhorn-system/instance-manager*/* | grep "Failed to create backup" | sort | awk '{ print $1 }'
2024-02-06T04:29:16.968897204Z
2024-02-06T04:38:26.856587527Z
2024-02-06T04:47:20.996068506Z
2024-02-06T04:49:37.315667981Z
2024-02-06T04:58:05.069025238Z
2024-02-06T05:07:29.771652195Z
2024-02-06T05:18:18.749408919Z
2024-02-06T05:18:22.223025461Z
2024-02-06T05:26:30.171523462Z
2024-02-06T05:28:18.806953111Z
2024-02-06T05:34:57.077060093Z
2024-02-06T05:36:23.822837034Z
2024-02-06T05:43:15.192112147Z
2024-02-06T05:45:02.002675977Z
2024-02-06T05:53:19.929305574Z
2024-02-06T05:53:24.058831215Z

and if there were backups every minute over the 90 minutes, most of them by far were successful.

yangchiu commented 8 months ago

I think it's fair that lock contention causes backup creation and deletion failed. If it makes sense to you as well, feel free to close this issue! cc @innobead

james-munson commented 8 months ago

Ok. I think I'll compare it to the issues seen in #7744 first.

innobead commented 8 months ago

Removing backport labels first, as this could be related to #7744.

james-munson commented 8 months ago

This is not the same as #7744. The symptom of lock collision is the same, but that one was caused both by a race in the test script itself, and another race in the cleanup of the backupstore when the setting itself was cleared.

This one is just a simple race between a new backup and the retention policy-based purge from the last backup such a short time before. In no case was the existing DELETION_LOCK older than 15 seconds at the time of collision.

I think this kind of collision was anticipated, looking at https://github.com/longhorn/backupstore/blob/b375644f1c5759cf0e8e91263ccf7dda09516004/lock.go#L104

Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.

yangchiu commented 7 months ago

Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.

The test case will be refined in https://github.com/longhorn/longhorn-tests/pull/1539

longhorn / longhorn