Closed yangchiu closed 8 months ago
I assume this also happened in 1.6.x. @yangchiu you missed the severity label.
@james-munson Please help with this. Thanks.
Could be difficult to reproduce. The test case could fail due to different reasons instead of what described in https://github.com/longhorn/longhorn/issues/7854#issue-2120056364.
It seems hard to predict when will a backup creation can be triggered by a recurring job after a node reboot.
@james-munson Please don't try to reproduce it by running the test case since it could be diverse. The test case needed to be refined first. Please look into the support bundle to see if there is any clue, if not, I suggest not to dig into this and wait until the PR is refined and merged. Sorry for the inconvenience.
The error comes from backupstore https://github.com/longhorn/backupstore/blob/9c90a8728ca2e1f95f10a4c6cf67cf7ac20c23e4/lock.go#L104-111
In logs/longhorn-system/instance-manager-a98079d3080d186b5dd5471b92655380/instance-manager.log
2024-02-06T05:53:24.019426177Z time="2024-02-06T05:53:24Z" level=info msg="Trying to acquire lock backupstore/volumes/e2/eb/vol-yde769/locks/lock-5242291888d04df6.lck" func="backupstore.(*FileLock).canAcquire" file="lock.go:65" lock="{ volume: vol-yde769, name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC }" pkg=backupstore
2024-02-06T05:53:24.019431667Z time="2024-02-06T05:53:24Z" level=info msg="backupstore volume vol-yde769 contains locks [{ volume: , name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC } { volume: , name: lock-e519ec1bc88a4080, type: 2, acquired: true, serverTime: 2024-02-06 05:53:17 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore
So the BACKUP_LOCK (LockType 1) lock-5242291888d04df6
is rejected by the already-acquired and unexpired DELETION_LOCK (LockType 2) lock-e519ec1bc88a4080
from 7 seconds before. What's up with that one?
There are 16 such collisions between a DELETION_LOCK and a BACKUP_LOCK for the two volumes over the previous 90 minutes, each leading to a residual failed backup CR. So either the deletion is taking surprisingly long, or the recurrence schedule of once a minute is pushing too hard.
Is the node restart having an impact on this, or does it happen even without that? The last node restart was at 2024-02-06T04:34:17, and its rebuild finished shortly after:
grep healthyAt replicas.yaml
healthyAt: "2024-02-06T04:16:50Z"
healthyAt: "2024-02-06T04:16:50Z"
healthyAt: "2024-02-06T04:16:41Z"
healthyAt: "2024-02-06T04:35:16Z"
and it's hard to see how it would still be having ripple effects an hour and a half later.
The times of the failures are well spread out
grep -h "failed lock" logs/longhorn-system/instance-manager*/* | grep "Failed to create backup" | sort | awk '{ print $1 }'
2024-02-06T04:29:16.968897204Z
2024-02-06T04:38:26.856587527Z
2024-02-06T04:47:20.996068506Z
2024-02-06T04:49:37.315667981Z
2024-02-06T04:58:05.069025238Z
2024-02-06T05:07:29.771652195Z
2024-02-06T05:18:18.749408919Z
2024-02-06T05:18:22.223025461Z
2024-02-06T05:26:30.171523462Z
2024-02-06T05:28:18.806953111Z
2024-02-06T05:34:57.077060093Z
2024-02-06T05:36:23.822837034Z
2024-02-06T05:43:15.192112147Z
2024-02-06T05:45:02.002675977Z
2024-02-06T05:53:19.929305574Z
2024-02-06T05:53:24.058831215Z
and if there were backups every minute over the 90 minutes, most of them by far were successful.
I think it's fair that lock contention causes backup creation and deletion failed. If it makes sense to you as well, feel free to close this issue! cc @innobead
Ok. I think I'll compare it to the issues seen in #7744 first.
Removing backport labels first, as this could be related to #7744.
This is not the same as #7744. The symptom of lock collision is the same, but that one was caused both by a race in the test script itself, and another race in the cleanup of the backupstore when the setting itself was cleared.
This one is just a simple race between a new backup and the retention policy-based purge from the last backup such a short time before. In no case was the existing DELETION_LOCK older than 15 seconds at the time of collision.
I think this kind of collision was anticipated, looking at https://github.com/longhorn/backupstore/blob/b375644f1c5759cf0e8e91263ccf7dda09516004/lock.go#L104
Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.
Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.
The test case will be refined in https://github.com/longhorn/longhorn-tests/pull/1539
Describe the bug
* * * * *
to create a backup every minute andretain = 1
to only keep one backup.Error
backup remain to the backup list: Even though a very long time after the reboot event, newly created backup could still randomly run intoError
state with error message:To Reproduce
Checkout PR https://github.com/longhorn/longhorn-tests/pull/1539 and run test case
Reboot Replica Node While Heavy Writing And Recurring Jobs Exist
Expected behavior
Support bundle for troubleshooting
supportbundle_359c9225-c5aa-42d3-9af0-6a301e68c081_2024-02-06T05-54-16Z.zip
Environment
Additional context