longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.96k stars 585 forks source link

[BUG][v1.7.x-head] Test case `test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress` failed due to `failed lock *.lck type 1 acquisition` #9037

Closed yangchiu closed 1 month ago

yangchiu commented 1 month ago

Describe the bug

Test case test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress failed on v1.7.x-head with reproducibility ~ 1/50 due to failed lock *.lck type 1 acquisition:

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7258/testReport/junit/tests/test_basic/test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress_s3_6_50_/

set_random_backupstore = None
client = <longhorn.Client object at 0x7fede0263d50>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fede0241f90>
volume_name = 'longhorn-testvol-xza8p1'

    def test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress(set_random_backupstore, client, core_api, volume_name):  # NOQA
        """
        Test DR volume last backup after block deletion aborted. This will set the
        last backup to be empty.

        Context:

        We want to make sure that when the block deletion for the last backup is
        aborted by operations such as backups in progress, the DR volume will still
        pick up the correct last backup.

        Steps:

        1.  Create a volume and attach to the current node.
        2.  Write 4 MB to the beginning of the volume (2 x 2MB backup blocks).
        3.  Create backup(0) of the volume.
        4.  Overwrite backup(0) 1st blocks of data on the volume.
            (Since backup(0) contains 2 blocks of data, the updated data is
            data1["content"] + data0["content"][BACKUP_BLOCK_SIZE:])
        5.  Create backup(1) of the volume.
        6.  Verify backup block count == 3.
        7.  Create DR volume from backup(1).
        8.  Verify DR volume last backup is backup(1).
        9.  Create an artificial in progress backup.cfg file.
            This cfg file will convince the longhorn manager that there is a
            backup being created. Then all subsequent backup block cleanup will be
            skipped.
        10. Delete backup(1).
        11. Verify backup block count == 3 (because of the in progress backup).
        12. Verify DR volume last backup is empty.
        13. Delete the artificial in progress backup.cfg file.
        14. Overwrite backup(0) 1st blocks of data on the volume.
            (Since backup(0) contains 2 blocks of data, the updated data is
            data2["content"] + data0["content"][BACKUP_BLOCK_SIZE:])
        15. Create backup(2) of the volume.
        16. Verify DR volume last backup is backup(2).
        17. Activate and verify DR volume data is
            data2["content"] + data0["content"][BACKUP_BLOCK_SIZE:].
        """
        backupstore_cleanup(client)

        host_id = get_self_host_id()

        vol = create_and_check_volume(client, volume_name,
                                      num_of_replicas=2,
                                      size=SIZE)
        vol.attach(hostId=host_id)
        vol = common.wait_for_volume_healthy(client, volume_name)

        data0 = {'pos': 0, 'len': 2 * BACKUP_BLOCK_SIZE,
                 'content': common.generate_random_data(2 * BACKUP_BLOCK_SIZE)}
        create_backup(client, volume_name, data0)

        data1 = {'pos': 0, 'len': BACKUP_BLOCK_SIZE,
                 'content': common.generate_random_data(BACKUP_BLOCK_SIZE)}
        _, backup1, _, data1 = create_backup(
            client, volume_name, data1)

        backup_blocks_count = backupstore_count_backup_block_files(client,
                                                                   core_api,
                                                                   volume_name)
        assert backup_blocks_count == 3

        dr_vol_name = "dr-" + volume_name
        client.create_volume(name=dr_vol_name, size=SIZE,
                             numberOfReplicas=2, fromBackup=backup1.url,
                             frontend="", standby=True)
        check_volume_last_backup(client, dr_vol_name, backup1.name)
        wait_for_backup_restore_completed(client, dr_vol_name, backup1.name)

        backupstore_create_dummy_in_progress_backup(client, core_api, volume_name)
        delete_backup(client, volume_name, backup1.name)
        assert backupstore_count_backup_block_files(client,
                                                    core_api,
                                                    volume_name) == 3
        check_volume_last_backup(client, dr_vol_name, "")
        backupstore_delete_dummy_in_progress_backup(client, core_api, volume_name)

        data2 = {'pos': 0,
                 'len': BACKUP_BLOCK_SIZE,
                 'content': common.generate_random_data(BACKUP_BLOCK_SIZE)}
>       _, backup2, _, _ = create_backup(client, volume_name, data2)

test_basic.py:973: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:459: in create_backup
    wait_for_backup_completion(client, volname, snap.name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0x7fede0263d50>
volume_name = 'longhorn-testvol-xza8p1'
snapshot_name = 'e47ebae9-8b74-4266-b835-6dbe90c0e0d4', retry_count = 300

    def wait_for_backup_completion(client, volume_name, snapshot_name=None,
                                   retry_count=RETRY_BACKUP_COUNTS):
        completed = False
        for _ in range(retry_count):
            v = client.by_id_volume(volume_name)
            for b in v.backupStatus:
                if snapshot_name is not None and b.snapshot != snapshot_name:
                    continue
                if b.state == "Completed":
                    assert b.progress == 100
                    assert b.error == ""
                    completed = True
                    break
            if completed:
                break
            time.sleep(RETRY_BACKUP_INTERVAL)
>       assert completed is True, f" Backup status = {b.state}," \
                                  f" Backup Progress = {b.progress}, Volume = {v}"
E       AssertionError:  Backup status = Error, Backup Progress = 0, Volume = {'accessMode': 'rwo', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [{'backupURL': 's3://backupbucket@us-east-1/backupstore?backup=backup-23d15bfdecc141b9&volume=longhorn-testvol-xza8p1', 'error': '', 'progress': 100, 'replica': 'longhorn-testvol-xza8p1-r-1d8e048c', 'size': '4194304', 'snapshot': '3017665e-0b3c-469d-a4e3-f6762d68ab45', 'state': 'Completed'}, {'backupURL': '', 'error': 'proxyServer=10.42.2.9:8501 destination=10.42.1.10:10000: failed to backup snapshot e47ebae9-8b74-4266-b835-6dbe90c0e0d4 to backup-3f1dfc6f3d2e46f8: rpc error: code = Internal desc = failed to create backup: failed to create backup to s3://backupbucket@us-east-1/backupstore for volume longhorn-testvol-xza8p1: rpc error: code = Unknown desc = failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': 'Error'}], 'cloneStatus': {'attemptCount': 0, 'nextAllowedAttemptAt': '', 'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'Restore': {'lastProbeTime': '', 'lastTransitionTime': '2024-07-18T07:11:25Z', 'message': '', 'reason': '', 'status': 'False'}, 'Scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2024-07-18T07:11:25Z', 'message': '', 'reason': '', 'status': 'True'}, 'TooManySnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2024-07-18T07:11:25Z', 'message': '', 'reason': '', 'status': 'False'}, 'WaitForBackingImage': {'lastProbeTime': '', 'lastTransitionTime': '2024-07-18T07:11:25Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '8388608', 'address': '10.42.1.10', 'currentImage': 'longhornio/longhorn-engine:v1.7.x-head', 'endpoint': '/dev/longhorn/longhorn-testvol-xza8p1', 'hostId': 'ip-10-0-2-106', 'image': 'longhornio/longhorn-engine:v1.7.x-head', 'instanceManagerName': 'instance-manager-1332f4efd18f3c091453a2b2bb0db662', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': '', 'name': 'longhorn-testvol-xza8p1-e-0', 'requestedBackupRestore': '', 'running': True, 'size': '16777216', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2024-07-18 07:11:24 +0000 UTC', 'currentImage': 'longhornio/longhorn-engine:v1.7.x-head', 'dataEngine': 'v1', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': False, 'diskSelector': [], 'encrypted': False, 'freezeFilesystemForSnapshot': 'ignored', 'fromBackup': '', 'frontend': 'blockdev', 'image': 'longhornio/longhorn-engine:v1.7.x-head', 'kubernetesStatus': {'lastPVCRefAt': '', 'lastPodRefAt': '', 'namespace': '', 'pvName': '', 'pvStatus': '', 'pvcName': '', 'workloadsStatus': None}, 'lastAttachedBy': '', 'lastBackup': '', 'lastBackupAt': '', 'migratable': False, 'name': 'longhorn-testvol-xza8p1', 'nodeSelector': [], 'numberOfReplicas': 2, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': [{'error': '', 'isPurging': False, 'progress': 0, 'replica': 'longhorn-testvol-xza8p1-r-284f5567', 'state': ''}, {'error': '', 'isPurging': False, 'progress': 0, 'replica': 'longhorn-testvol-xza8p1-r-1d8e048c', 'state': ''}], 'ready': True, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaDiskSoftAntiAffinity': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '10.42.4.10', 'currentImage': 'longhornio/longhorn-engine:v1.7.x-head', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-xza8p1-81c8cbaf', 'diskID': '47c4b299-0cec-4db4-ae65-fdf7e04980da', 'diskPath': '/var/lib/longhorn/', 'failedAt': '', 'hostId': 'ip-10-0-2-112', 'image': 'longhornio/longhorn-engine:v1.7.x-head', 'instanceManagerName': 'instance-manager-4bbfb99c67b4c7bfc38197f9b7634b07', 'mode': 'RW', 'name': 'longhorn-testvol-xza8p1-r-1d8e048c', 'running': True}, {'address': '10.42.2.9', 'currentImage': 'longhornio/longhorn-engine:v1.7.x-head', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-xza8p1-b8d12238', 'diskID': '104bad17-e269-4ea1-a643-5d3bd6681376', 'diskPath': '/var/lib/longhorn/', 'failedAt': '', 'hostId': 'ip-10-0-2-193', 'image': 'longhornio/longhorn-engine:v1.7.x-head', 'instanceManagerName': 'instance-manager-31adf39dba7fee19b32ecbe3f34dd732', 'mode': 'RW', 'name': 'longhorn-testvol-xza8p1-r-284f5567', 'running': True}], 'restoreInitiated': False, 'restoreRequired': False, 'restoreStatus': [{'backupURL': '', 'error': '', 'filename': '', 'isRestoring': False, 'lastRestored': '', 'progress': 0, 'replica': 'longhorn-testvol-xza8p1-r-284f5567', 'state': ''}, {'backupURL': '', 'error': '', 'filename': '', 'isRestoring': False, 'lastRestored': '', 'progress': 0, 'replica': 'longhorn-testvol-xza8p1-r-1d8e048c', 'state': ''}], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'healthy', 'shareEndpoint': '', 'shareState': '', 'size': '16777216', 'snapshotDataIntegrity': 'ignored', 'snapshotMaxCount': 250, 'snapshotMaxSize': '0', 'staleReplicaTimeout': 0, 'standby': False, 'state': 'attached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'': {'attachmentID': '', 'attachmentType': 'longhorn-api', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2024-07-18T07:11:29Z', 'message': '', 'reason': '', 'status': 'True'}], 'nodeID': 'ip-10-0-2-106', 'parameters': {'disableFrontend': 'false', 'lastAttachedBy': ''}, 'satisfied': True}}, 'volume': 'longhorn-testvol-xza8p1'}}

common.py:3220: AssertionError

To Reproduce

Run test case test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress repeatedly

Expected behavior

Support bundle for troubleshooting

longhorn-tests-regression-7258-bundle.zip

Environment

Additional context

derekbit commented 1 month ago

Is it reproducible in v1.6.2?

UPDATE: v1.6.2: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7269/console master-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7273/console

ChanYiLin commented 1 month ago

I would like to fix this issue with this ticket Make backup wait until there is no backup being delete and Add the progress time

This is going to resolve all kind of deletion lock timing issue which will make e2e flaky

derekbit commented 1 month ago

The issue is rare and needs more time to investigate. Move it to v1.8.0. cc @innobead

ChanYiLin commented 1 month ago
# Complete deleting the backup: `backup-26c512b1819948f7` volume=longhorn-testvol-xza8p1 in the backupstore using binary command
2024-07-18T07:12:41.799481473Z time="2024-07-18T07:12:41Z" level=info msg="Complete deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-26c512b1819948f7&volume=longhorn-testvol-xza8p1" func="engineapi.(*BackupTargetClient).BackupDelete" file="backups.go:304"
2024-07-18T07:12:41.802957380Z time="2024-07-18T07:12:41Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=BackupVolume, namespace: longhorn-system, name: longhorn-testvol-xza8p1, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook

# I think this is the dummy backup which needs to be pulled
2024-07-18T07:12:41.841410074Z time="2024-07-18T07:12:41Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:317" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112
2024-07-18T07:12:41.875645932Z time="2024-07-18T07:12:41Z" level=warning msg="Failed to get backupInfo from remote backup target" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:327" backup=backup-dummy backupVolume=longhorn-testvol-xza8p1 backuptarget="s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1" backupvolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume error="error getting backup config s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.7.x-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.7.x-head/longhorn backup inspect s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1], output invalid character '\\'' looking for beginning of object key string\n, stderr time=\"2024-07-18T07:12:41Z\" level=info msg=\"Loaded driver for s3://backupbucket@us-east-1/backupstore\" func=s3.initFunc file=\"s3.go:73\" pkg=s3\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Loading config in backupstore\" func=backupstore.LoadConfigInBackupStore file=\"config.go:56\" filepath=backupstore/volumes/96/04/longhorn-testvol-xza8p1/volume.cfg kind=s3 object=config pkg=backupstore reason=start\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Loaded config in backupstore\" func=backupstore.LoadConfigInBackupStore file=\"config.go:67\" filepath=backupstore/volumes/96/04/longhorn-testvol-xza8p1/volume.cfg kind=s3 object=config pkg=backupstore reason=complete\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Loading config in backupstore\" func=backupstore.LoadConfigInBackupStore file=\"config.go:56\" filepath=backupstore/volumes/96/04/longhorn-testvol-xza8p1/backups/backup_backup-dummy.cfg kind=s3 object=config pkg=backupstore reason=start\ntime=\"2024-07-18T07:12:41Z\" level=info msg=\"Failed to load backup in backupstore\" func=backupstore.InspectBackup file=\"inspect.go:55\" backup=backup-dummy event=list object=backup pkg=backupstore reason=fallback volume=longhorn-testvol-xza8p1\ntime=\"2024-07-18T07:12:41Z\" level=error msg=\"invalid character '\\\\'' looking for beginning of object key string\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" node=ip-10-0-2-112
2024-07-18T07:12:41.880293705Z time="2024-07-18T07:12:41Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Backup, namespace: longhorn-system, name: backup-dummy, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/labels\", \"value\": {\"longhorn.io/volume-access-mode\":\"rwo\"}},{\"op\": \"replace\", \"path\": \"/spec/backupMode\", \"value\": \"incremental\"}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:41.884040059Z W0718 07:12:41.883937       1 warnings.go:70] metadata.finalizers: "longhorn.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers

# don't know what this is, but this backup deletion might introduce the lock
2024-07-18T07:12:41.884199865Z time="2024-07-18T07:12:41Z" level=info msg="Found 1 backups in the backup target (typo: should be "cluster") that do not exist in the backup target and need to be deleted" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:354" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112

# webhook got `backup-3f1dfc6f3d2e46f8` created, it failed later at "2024-07-18T07:12:56Z"
2024-07-18T07:12:44.055002058Z time="2024-07-18T07:12:44Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Backup, namespace: longhorn-system, name: backup-3f1dfc6f3d2e46f8, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/labels\", \"value\": {\"longhorn.io/volume-access-mode\":\"rwo\"}},{\"op\": \"replace\", \"path\": \"/spec/backupMode\", \"value\": \"incremental\"}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:44.168350925Z time="2024-07-18T07:12:44Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=VolumeAttachment, namespace: longhorn-system, name: longhorn-testvol-xza8p1, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:44.995616805Z time="2024-07-18T07:12:44Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Snapshot, namespace: longhorn-system, name: e47ebae9-8b74-4266-b835-6dbe90c0e0d4, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/dataEngine\", \"value\": \"v1\"},{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"longhorn-testvol-xza8p1\"}},{\"op\": \"replace\", \"path\": \"/metadata/ownerReferences\", \"value\": [{\"apiVersion\":\"longhorn.io/v1beta2\",\"kind\":\"Volume\",\"name\":\"longhorn-testvol-xza8p1\",\"uid\":\"ca989265-0c44-416a-b85c-96e6dd7c30fd\"}]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:45.107208069Z time="2024-07-18T07:12:45Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Snapshot, namespace: longhorn-system, name: 82fcc2bb-d1d0-462d-8c11-d9b299c9fe4e, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/dataEngine\", \"value\": \"v1\"},{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"longhorn-testvol-xza8p1\"}},{\"op\": \"replace\", \"path\": \"/metadata/ownerReferences\", \"value\": [{\"apiVersion\":\"longhorn.io/v1beta2\",\"kind\":\"Volume\",\"name\":\"longhorn-testvol-xza8p1\",\"uid\":\"ca989265-0c44-416a-b85c-96e6dd7c30fd\"}]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:45.118545467Z time="2024-07-18T07:12:45Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Snapshot, namespace: longhorn-system, name: 15e86acc-daa4-44c5-91eb-302a249c36d0, operation: CREATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]},{\"op\": \"replace\", \"path\": \"/spec/dataEngine\", \"value\": \"v1\"},{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"longhorn-testvol-xza8p1\"}},{\"op\": \"replace\", \"path\": \"/metadata/ownerReferences\", \"value\": [{\"apiVersion\":\"longhorn.io/v1beta2\",\"kind\":\"Volume\",\"name\":\"longhorn-testvol-xza8p1\",\"uid\":\"ca989265-0c44-416a-b85c-96e6dd7c30fd\"}]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook
2024-07-18T07:12:54.895440547Z time="2024-07-18T07:12:54Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=BackupVolume, namespace: longhorn-system, name: longhorn-testvol-xza8p1, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/finalizers\", \"value\": [\"longhorn.io\"]}]" func="admission.(*Handler).admit" file="admission.go:115" service=admissionWebhook

# Same as above, but triggered again
# don't know what this is, but this backup deletion might introduce the lock
2024-07-18T07:12:54.936339227Z time="2024-07-18T07:12:54Z" level=info msg="Found 1 backups in the backup target (typo: should be "cluster") that do not exist in the backup target and need to be deleted" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:354" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112

Note: from another controller
# Jack: Failed to accquire lock: longhorn-testvol-xza8p1   backup-26c512b1819948f7
2024-07-18T07:12:56.622235224Z time="2024-07-18T07:12:56Z" level=info msg="backupstore volume longhorn-testvol-xza8p1 contains locks [{ volume: , name: lock-396f125ede724eaf, type: 2, acquired: false, serverTime: 2024-07-18 07:12:54 +0000 UTC } { volume: , name: lock-d3a98544fcfd47c0, type: 1, acquired: false, serverTime: 2024-07-18 07:12:54 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore
2024-07-18T07:12:56.623890999Z [longhorn-testvol-xza8p1-r-1d8e048c] time="2024-07-18T07:12:56Z" level=info msg="Removed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 on backupstore" func=backupstore.removeLock file="lock.go:180" pkg=backupstore
2024-07-18T07:12:56.625484900Z [longhorn-testvol-xza8p1-r-1d8e048c] time="2024-07-18T07:12:56Z" level=info msg="Removed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 on backupstore" func=backupstore.removeLock file="lock.go:180" pkg=backupstore
2024-07-18T07:12:56.625498855Z time="2024-07-18T07:12:56Z" level=error msg="Failed to create delta block backup" func=backupstore.CreateDeltaBlockBackup.func1 file="deltablock.go:142" destURL="s3://backupbucket@us-east-1/backupstore" error="failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition" snapshot="&{e47ebae9-8b74-4266-b835-6dbe90c0e0d4 2024-07-18T07:12:54Z}" volume="&{longhorn-testvol-xza8p1 16777216 map[VolumeRecurringJobInfo:{} longhorn.io/volume-access-mode:rwo] 2024-07-18T07:12:54Z   0   lz4  }"
2024-07-18T07:12:56.625503159Z time="2024-07-18T07:12:56Z" level=error msg="Failed to create backup backup-3f1dfc6f3d2e46f8" func="rpc.(*SyncAgentServer).BackupCreate" file="server.go:789" error="failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition"

I think there is a race, but not sure how and why

I am thinking, is it possible that

cc @derekbit

ChanYiLin commented 1 month ago

I think I found the root cause of this issue. From the steps

 Steps:
...
        => 9.  Create an artificial in progress backup.cfg file.
            This cfg file will convince the longhorn manager that there is a
            backup being created. Then all subsequent backup block cleanup will be
            skipped.
        => 10. Delete backup(1).
        11. Verify backup block count == 3 (because of the in progress backup).
        12. Verify DR volume last backup is empty.
        => 13. Delete the artificial in progress backup.cfg file.
        14. Overwrite backup(0) 1st blocks of data on the volume.
            (Since backup(0) contains 2 blocks of data, the updated data is
            data2["content"] + data0["content"][BACKUP_BLOCK_SIZE:])
        => 15. Create backup(2) of the volume.
        16. Verify DR volume last backup is backup(2).
        17. Activate and verify DR volume data is
            data2["content"] + data0["content"][BACKUP_BLOCK_SIZE:].

In step9, since we add an artificial backup.cfg which name is dummy in the backupstore, backup-volume-controller will create the backup CR dummy in the cluster In step13, we delete the artificial backup.cfg in the backupstore directly and consider it as deleted. However, the CR is still there, backup-volume-controller will scan the cluster and the backupstore and tries to delete the CR Thus, in the backup-controller, the dummy CR will be deleted and call the binary to delete the backup in the backupstore again which will introduce the lock.

We can see from the log


// dummy config was deleted, backup-volume-controller tried to delete the CR resource
2024-07-18T07:12:54.936339227Z time="2024-07-18T07:12:54Z" level=info msg="Found 1 backups in the backup target (typo: should be "cluster") that do not exist in the backup target and need to be deleted" func="controller.(*BackupVolumeController).reconcile" file="backup_volume_controller.go:354" backupVolume=longhorn-testvol-xza8p1 controller=longhorn-backup-volume node=ip-10-0-2-112

// failed to create backup because of the lock
2024-07-18T07:12:56.626248414Z time="2024-07-18T07:12:56Z" level=warning msg="Cannot take snapshot backup" func=engineapi.NewBackupMonitor file="backup_monitor.go:95" backup=backup-3f1dfc6f3d2e46f8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.1.10:10000: failed to backup snapshot e47ebae9-8b74-4266-b835-6dbe90c0e0d4 to backup-3f1dfc6f3d2e46f8: rpc error: code = Internal desc = failed to create backup: failed to create backup to s3://backupbucket@us-east-1/backupstore for volume longhorn-testvol-xza8p1: rpc error: code = Unknown desc = failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition" node=ip-10-0-2-193
2024-07-18T07:12:56.626431595Z time="2024-07-18T07:12:56Z" level=warning msg="Failed to enable the backup monitor for backup backup-3f1dfc6f3d2e46f8" func="controller.(*BackupController).reconcile" file="backup_controller.go:416" backup=backup-3f1dfc6f3d2e46f8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.1.10:10000: failed to backup snapshot e47ebae9-8b74-4266-b835-6dbe90c0e0d4 to backup-3f1dfc6f3d2e46f8: rpc error: code = Internal desc = failed to create backup: failed to create backup to s3://backupbucket@us-east-1/backupstore for volume longhorn-testvol-xza8p1: rpc error: code = Unknown desc = failed lock backupstore/volumes/96/04/longhorn-testvol-xza8p1/locks/lock-d3a98544fcfd47c0.lck type 1 acquisition" node=ip-10-0-2-193

// at the same time, the dummy backup CR was deleted.
2024-07-18T07:12:57.024169490Z time="2024-07-18T07:12:57Z" level=info msg="Complete deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-dummy&volume=longhorn-testvol-xza8p1" func="engineapi.(*BackupTargetClient).BackupDelete" file="backups.go:304"
longhorn-io-github-bot commented 1 month ago

Pre Ready-For-Testing Checklist

PRs:

chriscchien commented 1 month ago

Close this issue as test case is stable now (50 time on amd64 and arm64 all passed)