Closed yangchiu closed 1 month ago
Is it reproducible in v1.6.2?
UPDATE: v1.6.2: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7269/console master-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7273/console
I would like to fix this issue with this ticket Make backup wait until there is no backup being delete and Add the progress time
This is going to resolve all kind of deletion lock timing issue which will make e2e flaky
The issue is rare and needs more time to investigate. Move it to v1.8.0. cc @innobead
# Complete deleting the backup: `backup-26c512b1819948f7` volume=longhorn-testvol-xza8p1 in the backupstore using binary command
2024-07-18T07:12:41.799481473Z time="2024-07-18T07:12:41Z" level=info msg="Complete deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-26c512b1819948f7&volume=longhorn-testvol-xza8p1" func="engineapi.(*BackupTargetClient).BackupDelete" file="backups.go:304"
2024-07-18T07:12:41.802957380Z time="2024-07-18T07:12:41Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=BackupVolume, namespace: longhorn-system, name: longhorn-testvol-xza8p1, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
# I think this is the dummy backup which needs to be pulled
2024-07-18T07:12:41.841410074Z time="2024-07-18T07:12:41Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:317" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112
2024-07-18T07:12:41.875645932Z time="2024-07-18T07:12:41Z" level=warning msg="Failed to get backupInfo from remote backup target" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:327" backup=backup-dummy backupVolume=longhorn-testvol-xza8p1 backuptarget="s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1" backupvolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume error="error getting backup config s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.7.x-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.7.x-head/longhorn backup inspect s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1], output invalid character '\\'' looking for beginning of object key string\n, stderr time=\"2024-07-18T07:12:41Z\" level=info msg=\"Loaded driver for s3://backupbucket@us-east-1/backupstore\" func=s3.initFunc file=\"s3.go:73\" pkg=s3\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Loading config in backupstore\" func=backupstore.LoadConfigInBackupStore file=\"config.go:56\" filepath=backupstore/volumes/96/04/longhorn-testvol-xza8p1/volume.cfg kind=s3 object=config pkg=backupstore reason=start\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Loaded config in backupstore\" func=backupstore.LoadConfigInBackupStore file=\"config.go:67\" filepath=backupstore/volumes/96/04/longhorn-testvol-xza8p1/volume.cfg kind=s3 object=config pkg=backupstore reason=complete\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Loading config in backupstore\" func=backupstore.LoadConfigInBackupStore file=\"config.go:56\" filepath=backupstore/volumes/96/04/longhorn-testvol-xza8p1/backups/backup_backup-dummy.cfg kind=s3 object=config pkg=backupstore reason=start\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Failed to load backup in backupstore\" func=backupstore.InspectBackup file=\"inspect.go:55\" backup=backup-dummy event=list object=backup pkg=backupstore reason=fallback volume=longhorn-testvol-xza8p1\ntime=\"2024-07-18T07:12:41Z\" level=error msg=\"invalid character '\\\\'' looking for beginning of object key string\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" node=ip-10-0-2-112
2024-07-18T07:12:41.880293705Z time="2024-07-18T07:12:41Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Backup, namespace: longhorn-system, name: backup-dummy, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/labels\", \"value\": {\"longhorn.io/volume-access-mode\":\"rwo\"}},{\"op\": \"replace\", \"path\": \"/spec/backupMode\", \"value\": \"incremental\"}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:41.884040059Z W0718 07:12:41.883937 1 warnings.go:70] metadata.finalizers: "longhorn.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
# don't know what this is, but this backup deletion might introduce the lock
2024-07-18T07:12:41.884199865Z time="2024-07-18T07:12:41Z" level=info msg="Found 1 backups in the backup target (typo: should be "cluster") that do not exist in the backup target and need to be deleted" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:354" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112
# webhook got `backup-3f1dfc6f3d2e46f8` created, it failed later at "2024-07-18T07:12:56Z"
2024-07-18T07:12:44.055002058Z time="2024-07-18T07:12:44Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Backup, namespace: longhorn-system, name: backup-3f1dfc6f3d2e46f8, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/labels\", \"value\": {\"longhorn.io/volume-access-mode\":\"rwo\"}},{\"op\": \"replace\", \"path\": \"/spec/backupMode\", \"value\": \"incremental\"}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:44.168350925Z time="2024-07-18T07:12:44Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=VolumeAttachment, namespace: longhorn-system, name: longhorn-testvol-xza8p1, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:44.995616805Z time="2024-07-18T07:12:44Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Snapshot, namespace: longhorn-system, name: e47ebae9-8b74-4266-b835-6dbe90c0e0d4, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/dataEngine\", \"value\": \"v1\"},{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"longhorn-testvol-xza8p1\"}},{\"op\": \"replace\", \"path\": \"/metadata/ownerReferences\", \"value\": [{\"apiVersion\":\"longhorn.io/v1beta2\",\"kind\":\"Volume\",\"name\":\"longhorn-testvol-xza8p1\",\"uid\":\"ca989265-0c44-416a-b85c-96e6dd7c30fd\"}]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:45.107208069Z time="2024-07-18T07:12:45Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Snapshot, namespace: longhorn-system, name: 82fcc2bb-d1d0-462d-8c11-d9b299c9fe4e, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/dataEngine\", \"value\": \"v1\"},{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"longhorn-testvol-xza8p1\"}},{\"op\": \"replace\", \"path\": \"/metadata/ownerReferences\", \"value\": [{\"apiVersion\":\"longhorn.io/v1beta2\",\"kind\":\"Volume\",\"name\":\"longhorn-testvol-xza8p1\",\"uid\":\"ca989265-0c44-416a-b85c-96e6dd7c30fd\"}]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:45.118545467Z time="2024-07-18T07:12:45Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Snapshot, namespace: longhorn-system, name: 15e86acc-daa4-44c5-91eb-302a249c36d0, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/dataEngine\", \"value\": \"v1\"},{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"longhorn-testvol-xza8p1\"}},{\"op\": \"replace\", \"path\": \"/metadata/ownerReferences\", \"value\": [{\"apiVersion\":\"longhorn.io/v1beta2\",\"kind\":\"Volume\",\"name\":\"longhorn-testvol-xza8p1\",\"uid\":\"ca989265-0c44-416a-b85c-96e6dd7c30fd\"}]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:54.895440547Z time="2024-07-18T07:12:54Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=BackupVolume, namespace: longhorn-system, name: longhorn-testvol-xza8p1, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
# Same as above, but triggered again
# don't know what this is, but this backup deletion might introduce the lock
2024-07-18T07:12:54.936339227Z time="2024-07-18T07:12:54Z" level=info msg="Found 1 backups in the backup target (typo: should be "cluster") that do not exist in the backup target and need to be deleted" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:354" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112
Note: from another controller
# Jack: Failed to accquire lock: longhorn-testvol-xza8p1 backup-26c512b1819948f7
2024-07-18T07:12:56.622235224Z time="2024-07-18T07:12:56Z" level=info msg="backupstore volume longhorn-testvol-xza8p1 contains locks [{ volume: , name: lock-396f125ede724eaf, type: 2, acquired: false, serverTime: 2024-07-18 07:12:54 +0000 UTC } { volume: , name: lock-d3a98544fcfd47c0, type: 1, acquired: false, serverTime: 2024-07-18 07:12:54 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore
2024-07-18T07:12:56.623890999Z [longhorn-testvol-xza8p1-r-1d8e048c] time="2024-07-18T07:12:56Z" level=info msg="Removed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 on backupstore" func=backupstore.removeLock file="lock.go:180" pkg=backupstore
2024-07-18T07:12:56.625484900Z [longhorn-testvol-xza8p1-r-1d8e048c] time="2024-07-18T07:12:56Z" level=info msg="Removed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 on backupstore" func=backupstore.removeLock file="lock.go:180" pkg=backupstore
2024-07-18T07:12:56.625498855Z time="2024-07-18T07:12:56Z" level=error msg="Failed to create delta block backup" func=backupstore.CreateDeltaBlockBackup.func1 file="deltablock.go:142" destURL="s3://backupbucket@us-east-1/backupstore" error="failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition" snapshot="&{e47ebae9-8b74-4266-b835-6dbe90c0e0d4 2024-07-18T07:12:54Z}" volume="&{longhorn-testvol-xza8p1 16777216 map[VolumeRecurringJobInfo:{} longhorn.io/volume-access-mode:rwo] 2024-07-18T07:12:54Z 0 lz4 }"
2024-07-18T07:12:56.625503159Z time="2024-07-18T07:12:56Z" level=error msg="Failed to create backup backup-3f1dfc6f3d2e46f8" func="rpc.(*SyncAgentServer).BackupCreate" file="server.go:789" error="failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition"
I think there is a race, but not sure how and why
2024-07-18T07:12:41Z
: the backup-26c512b1819948f7
backup was deleted by the binary command (it should release the lock after completion)2024-07-18T07:12:44Z
: the backup-3f1dfc6f3d2e46f8
was created, observed from the webhook "CREATE"2024-07-18T07:12:54Z
: the backup-volume-controller found 1 backup (don't know what it is) in the cluster but not in the backupstore, decided to delete it.2024-07-18T07:12:56Z
: the backup-3f1dfc6f3d2e46f8
failed to completed, because hit the deletion lock issue.I am thinking, is it possible that
cc @derekbit
I think I found the root cause of this issue. From the steps
Steps:
...
=> 9. Create an artificial in progress backup.cfg file.
This cfg file will convince the longhorn manager that there is a
backup being created. Then all subsequent backup block cleanup will be
skipped.
=> 10. Delete backup(1).
11. Verify backup block count == 3 (because of the in progress backup).
12. Verify DR volume last backup is empty.
=> 13. Delete the artificial in progress backup.cfg file.
14. Overwrite backup(0) 1st blocks of data on the volume.
(Since backup(0) contains 2 blocks of data, the updated data is
data2["content"] + data0["content"][BACKUP_BLOCK_SIZE:])
=> 15. Create backup(2) of the volume.
16. Verify DR volume last backup is backup(2).
17. Activate and verify DR volume data is
data2["content"] + data0["content"][BACKUP_BLOCK_SIZE:].
In step9, since we add an artificial backup.cfg
which name is dummy
in the backupstore, backup-volume-controller
will create the backup CR dummy
in the cluster
In step13, we delete the artificial backup.cfg
in the backupstore directly and consider it as deleted.
However, the CR is still there, backup-volume-controller
will scan the cluster and the backupstore and tries to delete the CR
Thus, in the backup-controller
, the dummy
CR will be deleted and call the binary to delete the backup in the backupstore again which will introduce the lock.
We can see from the log
// dummy config was deleted, backup-volume-controller tried to delete the CR resource
2024-07-18T07:12:54.936339227Z time="2024-07-18T07:12:54Z" level=info msg="Found 1 backups in the backup target (typo: should be "cluster") that do not exist in the backup target and need to be deleted" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:354" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112
// failed to create backup because of the lock
2024-07-18T07:12:56.626248414Z time="2024-07-18T07:12:56Z" level=warning msg="Cannot take snapshot backup" func=engineapi.NewBackupMonitor file="backup_monitor.go:95" backup=backup-3f1dfc6f3d2e46f8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.1.10:10000: failed to backup snapshot e47ebae9-8b74-4266-b835-6dbe90c0e0d4 to backup-3f1dfc6f3d2e46f8: rpc error: code = Internal desc = failed to create backup: failed to create backup to s3://backupbucket@us-east-1/backupstore for volume longhorn-testvol-xza8p1: rpc error: code = Unknown desc = failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition" node=ip-10-0-2-193
2024-07-18T07:12:56.626431595Z time="2024-07-18T07:12:56Z" level=warning msg="Failed to enable the backup monitor for backup backup-3f1dfc6f3d2e46f8" func="controller.(*BackupController).reconcile" file="backup_controller.go:416" backup=backup-3f1dfc6f3d2e46f8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.1.10:10000: failed to backup snapshot e47ebae9-8b74-4266-b835-6dbe90c0e0d4 to backup-3f1dfc6f3d2e46f8: rpc error: code = Internal desc = failed to create backup: failed to create backup to s3://backupbucket@us-east-1/backupstore for volume longhorn-testvol-xza8p1: rpc error: code = Unknown desc = failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition" node=ip-10-0-2-193
// at the same time, the dummy backup CR was deleted.
2024-07-18T07:12:57.024169490Z time="2024-07-18T07:12:57Z" level=info msg="Complete deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1" func="engineapi.(*BackupTargetClient).BackupDelete" file="backups.go:304"
test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress
becomes stablePRs:
Describe the bug
Test case
test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress
failed onv1.7.x-head
with reproducibility ~ 1/50 due tofailed lock *.lck type 1 acquisition
:https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7258/testReport/junit/tests/test_basic/test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress_s3_6_50_/
To Reproduce
Run test case
test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress
repeatedlyExpected behavior
Support bundle for troubleshooting
longhorn-tests-regression-7258-bundle.zip
Environment
Additional context